I'm building a web app. This app will use MySQL to store all the information associated with each user. However, it will also use MySQL to store sys admin type stuff like error logs, event logs, various temporary tokens, etc. This second set of information will probably be larger than the first set, and it's not as important. If I lost all my error logs, the site would go on without a hiccup.
I am torn on whether to have multiple databases for these different types of information, or stuff it all into a single database, in multiple tables.
The reason to keep it all in one, is that I only have to open up one connection. I've noticed a measurable time penalty for connection opening, particularly using remote mysql servers.
What do you guys do?
Fisrt,i must say, i think storing all your event logs, error logs in db is a very bad idea, instead you may want to store them on the filesystem.
You will only need error logs or event logs if something in your web app goes unexpected. Then you download the file, and examine it, thats all. No need to store it on the db. It will slow down your db and your web app.
As an answer to your question, if you really want to do that, you should seperate them, and you should find a way to keep your page running even your event og and error log databases are loaded and responding slowly.
Going with two distinct database (one for your application's "core" data, and another one for "technical" data) might not be a bad idea, at least if you expect your application to have a lot of users :
it'll allow you to put one DB on one server, and the other DB on a second server
and you can think about scaling a bit more, later : more servers for the "core" data, and still only one for the "technical" data -- or the opposite
if the "technical" data is not as important, you can (more easily) have two distinct backup processes / policies
having two distinct databases, and two distinct servers, also means you can have heavy calculations on the technical data, without impacting the DB server that hosts the "core" data -- and those calculations can be useful, on logs, or stuff like that.
as a sidenote : if you don't need that kind of "reporting" calculations, maybe storing those data to a DB is not useful, and files would do perfectly ?
Maybe opening two connections means a bit more time -- but that difference is probably rather negligible, is it not ?
I've worked a couple of times on applications that would use two database :
One "master" / "write" database, that would be used only for writes
and one "slave" database (a replication of the first one, to several slave servers), that would be used for reads
This way, yes, we sometimes open two connections -- bu one server alone would not have been able to handle the load...
Use connection pooling anyway. So the time to get a connection is not a problem. But if you have 2 connections, transaction handling become more complicated. On the other hand, sometimes it's handy to have 2 connections: if something goes wrong on the business transaction, you can rollback transaction and still log the failure on the admin transaction. But I would still stick to one database.
I would only use one databse - mostly for the reason you supply: You only need one connection to reach both logging and user stored data.
Depending on your programming language, some frameworks (J2EE as an example) provide connection pooling. With two databases you would need two pools. In PHP on the other hand, the performance come in to perspective when setting up a connection (or two).
I see no reason for two databases. It'd be perfectly acceptable to have tables that are devoted to "technical" and "business"data, but the logical separation should be sufficient.
Physical separation doesn't seem necessary to me, unless you mean an application and data warehouse star schema. In that case, it's either real-time updates or, more typically, a nightly batch ETL.
It makes no difference to mysql in any way whether you use separate "datbases", they are simply catalogues.
It may make setting permissions easier, this is a legitimate reason to do it. Other than that, it is exactly the same as keeping the tables in the same db (except you can have several tables with the same name ... but please don't)
Putting them on separate servers might be a good idea however, as you probably don't want your core critical (user info, for example) data mixed in with your high-volume, unimportant data. This is particularly true for old audit data, debug logs etc.
Also short-lived data, such as search results, sessions etc, could be placed on a different server - it presumably has no high availability[1] requirement.
Having said that, if you don't need to do this, dump it all on one server where it's easier to manage (backup, provide high availibilty, manage security etc).
It is not generally possible to take a consistent snapshot of data on >1 server. This is a good reason to only have one (or one that you care about for backup purposes)
[1] Of the data, not the database.
In MySQL, InnoDB has an option of storing all tables of a certain database in one file, or having one file per table.
Having one file per table is somewhat recommended anyway, and if you do that, it makes difference on the database storage level if you have one database or several.
With connection pooling, one database or several is probably not going to matter either.
So, in my opinion, the question is if you'd ever consider separating the "other half" of the database to a separate server - with the separate server having perhaps a very different hardware configuration, such as no RAID. If so, consider using separate databases. If not, use a single database.
Related
This is the first time I have worked with a read replica so please bare with me. I tried multiple searches on here and google to no avail.
I know how a primary database instance and a read replica on RDS. Replication is working great, I can connect to both instances with no problem. My question is two fold.
Do I need to update my API models so that read only operations are directed to the replica?
Are there any tips to get the best performance from this setup? Can I adjust indexes on the master so that write commands have less indexes than tables I need for read? Lastly, some of my models read a dataset, return the value, then log the activity in an audit log table. Would I need to split this functionality out so the logging activity occurs on the main write database?
Appreciate any guidance here.
It depends. The replica will always be slightly behind its source instance, so if you have queries that have a requirement to read current data always, they need to read from the source instance.
Different queries even in your single app will have different needs in that respect. So you really need to decide on a case-by-case basis which query can read from the read-replica instance.
I wrote a presentation about this: Read Write Splitting in MySQL or the video.
Regarding performance, you may have different secondary indexes on the same tables on each instance. Indexes depend on which queries you will run, so you have to decide first which queries you will run on the source versus the replica.
But consider that if the read-replica goes offline for some reason, or falls too far behind or something, you might want to be able to run the same queries on the source instance until the replica is ready again. In that case, you'd want the same indexes on both, or else your query will run very slowly.
Another possibility is that the source instance dies for some reason. In the cloud, you must always have a plan for resources disappearing on short notice, including databases! It happens. So if the source instance dies, the read-replica could take over as your new source. But if it has different indexes, different tables, different instance size, or whatever, then it may not be prepared to be that substitute.
It's simpler to just make the source and replica the same in as many ways as you can. Less details to track, and you're prepared for failover.
In our software, we share information across installations.
Currently we use staging tables within the same database to facilitate this. We use stored procedures to pull certain data from the live tables into the staging tables, and then dump them. This dump then gets loaded into the staging tables of the target database, and stored procedures merge in the data.
This works fine, but for a few reasons*, I'm considering moving this from staging tables to a separate staging database. I'm just concerned about whether or not this will have any performance implications.
Having very quickly tried this (just as a thought exercise) on a couple of differently configured systems, I've come up with differing results. One (with not much data, and running MySQL 5.6) showed no real difference, possibly even slightly faster. The other (with much more data, running MySQL 5.5) showed it to be about 1.5 times slower.
I'm well aware that there will likely be configuration options that may affect this, I'm no DBA, so any pointers would be much appreciated.
TL;DR
What performance implications might there be in inserting data into tables in a different database (on the same server), compared to within the same database. Will it depend on MySQL version, or configuration settings?
* If you're interested in 'reasons', I can let you know in the comments
Let me start by saying that you cannot conduct tests the way you did.
Databases, and MySQL among them, rely on hardware. There are number of optimizations for MySQL variables available which can turn it from a snail to formula 1.
Consequently, you tested on separate systems, and each either runs on different hardware or contains different data. Technically, what MySQL does by default is use InnoDB storage engine. InnoDB storage engine, by default, stores all the data into a single file. So from some "down to the core" perspective, whether you use a table or a database - MySQL won't really care because it will store it into the one and the same file. From there on we get to the point whether the file is fragmented or not (if on mechanical disk) and to many other interesting details that can't be covered in a single answer.
That brings us to next issue - databases exist to store structured data.
Databases are not glorified text files. They exist so we can ask them to cross-examine data and give us meaningful results to questions.
That means we design databases and tables in ways that correspond to certain logic. If it makes sense to store those staging tables into database A, then store it. If it makes sense to do some other thing - do the other thing.
From performance point of view, it hugely depends on how your servers are configured, both from OS to MySQL variables, to which hardware (especially HDD) you have. Without knowing what's going on there down to the last detail - no one can tell you "Yes, it is faster", "No it is slower" or "It is the same". If they do - they're lying. We basically lack every possible information to tell you with certainty which approach is better.
I've seen pictures like this where multiple rails engines write to a single mySQL server.
1) Is this possible? Or does Rails want each application server to write to one database server?
2) If this is possible, how is it accomplished? Are there queues and a scheduler between the application servers and the write database server?
Scaling a mysql db is a pretty difficult thing to do, but its certainly been done plenty of times and there are a lot of best practices out there for you to take advantage of. The first thing you should know is that before you worry about scaling writes for a while yet, you probably need to scale your reads first.
Scaling reads can be done fairly easily using replication. There are several tools out there that make managing replication a lot easier such as Amazon RDS. Generally speaking many web severs can connect to many databases (as suggested by others), however you quickly run into scale issues once you have a lot of traffic, connections or whatever other action you are performing that generates load on the server.
As replicated severs are read only, you need to manage which sever you connect to depending on the action you're performing. I.e. if you had a users table, when creating, updating or deleting users you need to use the "write" database (the primary "source" sever) but when reading the user table, you can use one of the read replicas. This reduces the load on the primary write sever (allowing it to deal with even more writes) and as you can have multiple read databases behind a load balancer, you can get away with this structure for a very long time and scale reads across tens of database severs before you'll hit any significant issues (however most apps get away with 1-3).
There are situations where you will need to use your write database for read actions (although you should avoid it as much as possible) as the read replicas can be slightly behind the write dbs due to latency in replicating the write db queries, however most of the time you should be able to code knowing that there is the possibility that the read db is delayed (i.e. queue actions a reasonable period of time such that the updates will propagate across all the read severs) and simply use one of your read dbs rather than the write db.
Beyond this the key items to work on are ensuring you have efficient indexes and applying other best practices around maintaining a sensible data structure. You might also want to consider having 3 distinct "groups" of database servers. I generally like to have write, read and "stats" db groups. The write group for create, update and delete operations (as well as select for update), the read for general read items that must return their results quickly, and stats for anything that is going to be under high load and that you do not rely on for a prompt response (this keeps heavy queries that are not time sensitive away from your read db that you need quick responses from for general reads)
Once you get into a situation where you can no longer buy larger hardware and you're near maxing out your write capacity, you'll need to look into sharding, however that will take a lot of traffic / data (so dont worry about it unless you've done all of the above already).
I have been searching for this for a while and unable to find something useful.
Is it a good practice or even important to create 2 MySQL users, one for reading and then use that whenever I'm initiating a MySQL SELECT.
And on the other side, create another user for writing and use it whenever I'm doing an INSERT, UPDATE, DELETE, ...?
Would this help at anything for example if I'm writing and reading to the database at the same time?
Assume we're using InnoDB tables.
"good practice" is very hard to define - you've got a whole bunch of different things to trade off against each other.
I'm assuming that the database is being used as a back-end for some other system, and that your users don't have direct access to a SQL prompt. In that case, there are no real benefits to creating different MySQL users - it simply makes the front-end more complex, and an attacker who can reach the database and knows the "read-only" credentials almost certainly also knows the "read/write" credentials. From a security point of view, you should invest your time in network security of the database server, and secure storage of connection details.
From a concurrency point of view - two or more users reading and writing at the same time - you won't really gain anything either. This particular requirement is one of the things relational databases do very well, and I don't think it's affected at all by the permissions of the users - it's far more to do with whether you're using transactions, and how quickly your SQL executes.
I need to set up a MySQL environment that will support adding many unique databases over time (thousands, actually).
I assume that at some point I will need to start adding MySQL servers, and would like my environment to be prepared for the case beforehand, to make the transition to a 2nd, 3rd, 100th server easy.
And just to make it interesting, It would be very convenient if the solution was modeled so the application that queries the databases sends all the queries to a single address and receives a result. It should be unaware of the number and location of the servers. The database name is unique and can be used to figure out which server holds the database.
I've done some research, and MySQL Proxy pops out as the main candidate, but I haven't been able to find anything specific about making it perform as described above.
Anyone?
Great question. I know of several companies that have done this (Facebook jumps out as the biggest). None are happy, but alternatives kind of suck, too.
More things for you to consider -- what happens when some of these databases or servers fail? What happens when you need to do a cross-database query (and you will, even if you don't think so right now).
Here's the FriendFeed solution: http://bret.appspot.com/entry/how-friendfeed-uses-mysql
It's a bit "back-asswards" since they are basically using MySQL as a glorified key-value store. I am not sure why they don't just cut out the middleman and use something like BerkeleyDB for storing their objects. Connection management, maybe? Seems like the MySQL overhead would be too high a price to pay for something that could be added pretty easily (famous last words).
What you are really looking for (I think) is a distributed share-nothing database. Several have been built on top of open-source technologies like MySQL and PostgreSQL, but none are available for free. If you are in the buying mood, check out these companies: Greenplum, AsterData, Netezza, Vertica.
There is also a large number of various distributed key-value storage solutions out there. For lack of a better reference, here's a starting point: http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores/ .
Your problem sounds similar to one we faced - that you are acting as a white-label, and that each client needs to have their own separate database. Assuming this concept parallels yours, what we did was leverage a "master" database that stored the hostname and database name for the client (which could be cached in the application tier). The server the client was accessing could then dynamically shift its datasource to the required database. This allowed us to scale up to thousands of client databases, scattered across servers.