Database separation - MySQL - mysql

I have a main MySQL db set up, and a class to handle the queries to it. It runs real nice. I am building a custom advertising system on my site and I'm wondering if there is any benefit to creating a separate database all together to handle that system?
Is there any pitfalls to doing it either way?
Option #1 - one DB for main website function, one DB for advertising system
Option #2 - one DB for both main website function and advertising system

Well, you need a new connection for every Database you use, also you need a new instance of your DB-Class - both costs some (minimal) memory. I personally see no reason why you would need/want to do this. If you just want to separate the two things, maybe you could use a prefix like "adv_" for the advertisement tables.
Edit: another problem could come up if you ever want to combine (e.g. join) data of the two databases - you will have a much easier time if you do not use multiple databases.

Johnnietheblack, there is no easy answer here, and not even one right answer: different tables need different approaches, and sometimes you have to throw away an academic/more "secure" database model to improve performance & scalability.
It's always a matter of trade-offs. Based on my personal experience, I have some thoughts to share with you:
When you separate tables in different databases, you have more work to do in your data abstraction layers to keep referential integrity (you have to do the DB chores...) and to link information. On the other hand, it's easier to manage the databases (indexes, data files, query tunings, etc.).
Tables with high insert rate and low maintenance (update/delete) and where referential integrity is not that important - like log tables - are good candidates to be put in a separate database: although the I/O from inserts are heavy, the records don't change over time, they are rarely retrieved, and their indexes tend to be pretty simple (date/time and some other attribute). I have one case where the log file was so big (millions of records) that at a point a single insert was taking almost 1 sec. Since it has 500 thousand new records each day, it was a snowball: we cannot stop the system to tune the damn thing because it takes too long, and the system was shutting down because this log table was used everywhere and was impacting the business (75% of the procedures used this table).
Databases can eat THOUSANDS of records for breakfast, so for small tables (less than 1000 records) you generally don't need to worry about, just the big ones ( more than 5000). I have a friend DBA that simply does not create indexes for performance in most of the tables: he made some tests and discovered that their SQL Server was changing the query plan to TABLE SCANS for most of the tables. But be careful here: is strong medicine!
Try to think about SaaS when it comes to define if a new tables set should be put together inside a database: your advertising system needs to be tightly integrated with your website or it can be a separate component, reusable by other components? If it is the later, you should think about using separate databases, to minimize impacts when you update the schema, do maintenance in the new tables, etc.
There are so many other cases, but alas, we have so little time... The important thing here is to keep an open mind and try to forget a little bit about 3rd form academically perfect database models. Hope it helps!

Related

Convert LOTS of identical MySQL tables into ONE and lots of VIEWs that point to it?

I'm running a pretty big deployment of WPMU (Wordpress Multi-User, Wordpress Multisite) that uses 4096 databases and 100k+ tables (with a lot of overlap in what schema is concerned, obviously).
Basically it's the same 20-some tables replicated over and over again for each and every blog, some of them empty, others containing a few to a few hundred rows.
My plan (that saves lots of headaches but may prove inefficient) is to merge all same-schema tables into a few big-ish InnoDB tables and replace the old ones with MySQL VIEWs that point to them, rewriting the queries so that the relevant rows are returned (store the old table name in a new column and then using the view to add the column to the query WHERE clause).
The question is: would this provide ANY kind of improvement in what performance is concerned? (key buffer efficiency, table cache efficiency, indexing) or is this just snake oil and I should resort to a more drastic approach of rewriting the app in such a way that i don't need VIEWs but the queries go straight to the big InnoDB tables?
I would recommend against doing the table merge you're thinking of.
Consider some of the downsides to merging the tables:
Index data structures for the merged tables will be larger and deeper, and therefore less efficient.
Blogs that accumulate a lot of data but then go idle still contribute to the overall size of the tables and indexes, and therefore make queries take longer.
Harder to back up and restore an individual blog.
Harder to move an individual blog to another database server if you want to scale out.
Harder to use SQL privileges to restrict access for a given blog (though you could apply SQL privileges to the views).
Harder to add custom features that include schema changes for a given blog.
Using views or not using views doesn't affect the above issues positively or negatively. In MySQL at least, a view is basically just a query rewrite at runtime, it's not going to use indexes any better or worse than querying the base tables directly.
I once spoke with the database architect for Wordpress.com. They host millions of Wordpress blogs on dozens hundreds of physical servers. In their early days, they started out with the data for all blogs merged into the same tables, but they found the operational difficulties became too great as they grew. Now they host each blog in a separate database.

How can I optimize my database?

I am creating a platform for some clients. Each client needs to have contacts and manage them in groups, categories (which depends of the group) and subcategories (which depends of the category).
The database is going to be very big, and Im afraid about the performance. I want to optimize the database; now, I have these options:
Manage only one database with multiple tables (as we manage now)
Create a database for each client (each database will have the same multiple tables as the option 1)
Manage multiple XML files (like option 2, each client will have a directory with an XML for contacts, another XML file for groups, another for categories, and so on)
Wich is the best option for performance and management of the data (CRUD, create, read, update, delete)??
Thanks!!
I think one database with multiple tables is the way to go, because duplicating the database and schema for each new client doesn't scale well. XML files sounds cool but so far I haven't seen an XML read/write engine which is as fast as most RDBMSes, so bin that one.
To make this work (lots of tables in one database) you should pay attention to indexing and optimizing the one database; indexes in particular will help you maintain speed as you scale up.
Use clustered indexing on the clienId in whichever table it might exist as a foreign key. This procedure will give you the best client-centric performance because you would (usually) be pulling a particular client's info in a page fetch.
For #2, I would suggest making that a premium service to your clients. If they want "priority hosting" on a separate server of "their own" then they pay extra. That will make the maintenance headache worthwhile.
Have you tried actually implementing 1 (which is the easiest)?
Did you profile the code?
What is the performance now?
use EXPLAIN to see how the queries are performing?
Do you use indexes (often correct indexes are enough to give excellent performance changes)?
Optimize when you hit a bottleneck (or when you set certain benchmarks for performance), not during design phase...
UPDATE: You mentioned "millions of entries". That's nothing for mysql (provided you use correct indexes on your tables). I have a table with about 40 million rows & although it's not lightning fast it gives me results in a couple of seconds. So there you go...
3 is not advisable. Search etc. is not what XML files do efficiently.
2 is a maintenance problem.
1 should be doable. "very big" means what? I have a database with a tabe with currently 1.5 billion entries - that is "big" not "very big". What do you define as very big?
As far as ongoing maintenance and support goes I think only option 1 makes sense for you.
Index all columns you need to but nothing more. Look at your code and see how tables are being JOINed and index the columns which will otherwise require a table scan.
Indicies will speed up the read operations but slow down your write operations as you need to update the indicies as well as the column. They also need more space in the DB.
As suggested above use EXPLAIN to see how your queries are executing and what can be optimized there.
Finally performance tuning only works well after you baseline your existing performance, make a change, then baseline performance again to see if it helped. If not roll back and try something else. But always start with a known level of performance, otherwise you might end up making multiple changes which in total slow things down. Good luck!

Private messaging system, large single table versus many small tables

I'm considering a design for a private messaging system and I need some input here, basically I have several questions regarding this. I've read most of the related questions and they've given me some thought already.
All of the basic messaging systems I've thus far looked into use a single table for all of the users' messages. With indexes etc this approach would seem fine.
What I wanted to know is if there would be any benefit to splitting the user messages into separate tables. So when a new user is created a new table is created (either in the same or a dedicated message database) which stores all of the messages - sent and received -for that user.
What are the pitfalls/benefits to approaching things that way?
I'm writing in PHP would the code required to write be particularly more cumbersome than the first large table option?
Would the eventual result, with a large amount of smaller tables be a more robust, trouble free design than one large table?
In the event of large amounts of concurrent users, how would the performance of the server compare where dealing with one large versus many small tables?
Any help with those questions or other input would be appreciated. I'm currently working through a smaller scale design for my test site before rewriting the PM module and would like to optimise it. My poor human brain handles separate table far more easily, but the same isn't necessarily so for a computer.
You'll just get headaches from moving to small numerous tables. Databases are made for handling lots of data, let it do it's thing.
You'll likely end up using dynamic table names in queries (SELECT * FROM $username WHERE ...), making smart features like stored procedures and possibly parameterized queries a lot trickier if not outright impossible. Usually a really bad idea.
Try rewriting SELECT * FROM messages WHERE authorID = 1 ORDER BY date_posted DESC, but where "messages" is anywhere between 1 and 30,000 different tables. Keeping your table relations monogamous will keep them bidirectional, way more useful.
If you think table size will really be a problem, set up an "archived messages" clone table and periodically move old & not-unread messages there where they won't get in the way. Also note how most forum software with private messaging allows for limiting user inbox sizes. There are a few ways to solve the problem while keeping things sane.
I'm agreeing with #MarkR here - in that initially the one table for messages is definitely the way to proceed. As time progresses and should you end up with a very large table then you can consider how to partition the table to best proceed. That's counter to the way I'd normally advise design, but we're talking about one table which is fairly simple - not a huge enterprise system.
A very long time ago (pre availability of SQL databases) I built a system that stored private and public messages, and I can confirm that once you split a message base logical entity into more than one everything¹ becomes a lot more complicated; and I doubt that a user per file is the right approach - the overheads will be massive compared to the benefit.
Avoid auto-increment[2] - and using natural keys is very important to the future scalability. Designing well to ensure that you can insert and retrieve without locking will be of more benefit.
¹ Indexing, threading, searching, purging/archiving.
² Natural keys are better if you can find one for your data as the autoincremented ID does not describe the data at all and databases are good at locating based on the primary key, so a natural primary key can improve things. Autoincrement can cause problems with a distributed database; it also leaks data when presented externally (to see the number of users registered just create a new account and check your user ID). If you can't find a natural key then a UUID (or GUID) may still be a better option - providing that the database has good support for this as a primary key. See When to use an auto-incremented primary key and when not to
Creating one table per user certainly won't scale well when there are a large number of users with a small number of messages. The way MySQL handles table opening/closing, very large numbers of tables (> 10k, say) become quite inefficient, especially at server startup and shutdown, as well as trying to backup non-transactional tables.
However, the way you've worded your question sounds like a case of premature optimisation. Make it work first, then fix performance problems. This is always the right way to do things.
Partitioning / sharding will become necessary once your scale gets high enough. But there are a lot of other things to worry about in the mean time. Sort them out first :)
One table is the right way to go from an RDBMS PoV. I recommend you use it until you know better.
Splitting large amounts of data into smaller sets makes sense if you're trying to avoid locking issues: for example - locking the messages table - doing big selects or updating huge amounts of data at once. In this case long running queries could block whole table and everyone needs to wait... You should ask yourself if this going to happen in your case? At least for me it looks like messaging system is not going to have such things because all information is being pushed into table or retrieved from it in rather small sets. If this is a user centric application - so, for example, getting all messages for single user is quite easy and fast to do, the same goes also for creating new messages for one or another particular user... Unless you would have really huge amounts of users/messages in your system.
Splitting data into multiple tables has also some drawbacks - you will need kind of management system or logic how do you split everything - giving separate table for each user could grow up soon into hundreds or thousands of tables - which is, in my opinion, not that nice. Therefore probably you would need some other criteria how to split the data. If you want splitting logic to be dynamic and easy adjustable - you would probably need also to save it in DB somehow. As you see complexity grows...
As advantage of such data sharding could be the scalability - you could easy put different sets of data on different machines once single machine is not able to handle whole load.
It depends how your message system works.
Are there cuncurrency issue?
Does it need to be scalable as the application accomodate more customers?
Designing one table will perfectly work on small, one message at a time single user system.
However, if you are considering multiple user, concurrent messaging system, the tables should be splited
Data model for Real time application is recommended to be "normalized"(Spliting table) due to "locking & latching" and data redundency issue.
Locking policy varies by Database Vendor. If you have tables that have updates & select by applicaiton concurrently, "Locking"(page level, row level, table level depending on vendor) issue araise. Some bad DB & app design completely lock the table so message never go through.
Redendency issue is more clear. If you use only one table, some information(like user. I guess one user could send multiple messages) is redundent.
Try to google with "normalization", 'Locking"..

Max Tables & Design Pattern

I am working on an app right now which has the potential to grow quite large. The whole application runs through a single domain, with customers being given sub-domains, which means that it all, of course, runs through a common code-base.
What I am struggling with is the database design. I am not sure if it would be better to have a column in each table specifying the customer id, or to create a new set of tables (in the same database), or to create a complete new database per customer.
The nice thing about a "flag" in the database specifying the customer id is that everything is in a single location. The downfalls are obvious- Tables can (will) get huge, and maintenance can become a complete nightmare. If growth occurs, splitting this up over several servers is going to be a huge pain.
The nice thing about creating new tables it is easy to do, and also keeps the tables pretty small. And since customers data doesn't need to interact, there aren't any problems there. But again, maintenance might become an issue (Although I do have a migrations library that will do updates on the fly per customer, so that is no big deal). The other issue is I have no idea how many tables can be in a single database. Does anyone know what the limit is, and what the performance issues would be?
The nice thing about creating a new database per customer, is that when I need to scale, I will be able to, quite nicely. There are several sites that make use of this design (wordpress.com, etc). It has been shown to be effective, but also have some downfalls.
So, basically I am just looking for some advice on which direction I should (could) go.
Single Database Pros
One database to maintain. One database to rule them all, and in the darkness - bind them...
One connection string
Can use Clustering
Separate Database per Customer Pros
Support for customization on per customer basis
Security: No chance of customers seeing each others data
Conclusion
The separate database approach would be valid if you plan to support customer customization. Otherwise, I don't see the security as a big issue - if someone gets the db credentials, do you really think they won't see what other databases are on that server?
Multiple Databases.
Different customers will have different needs, and it will allow you to serve them better.
Furthermore, if a particular customer is hammering the database, you don't want that to negatively affect the site performance for all your other customers. If everything is on one database, you have no damage control mechanism.
The risk of accidentally sharing data between customers is much smaller with separate database. If you'd like to have all data in one place, for example for reporting, set up a reporting database the customers cannot access.
Separate databases allow you to roll out, and test, a bugfix for just one customer.
There is no limit on the amount of tables in MySQL, you can make an insane amount of them. I'd call anything above a hundred tables per database a maintenance nightmare though.
Are you planning to develop a Cloud App?
I think that you don´t need to make tables or data bases by customer. I recommend you to use a more scalable relational database management system. Personally I don´t know the capabilities of MySQL, but i´m pretty sure that it should support distributed data base model in order to handle the load.
creating tables or databases per customer can lead you to a maintenance nightmare.
I have worked with multi-company databases and every table contains customer ids and to access its data we develop views per customer (for reporting purposes)
Good luck,
You can do whatever you want.
If you've got the customer_id in each column, then you've got to write the whole application that way. That's not exactly true as there should be enough to add that column only to some tables, the rest could be done using some simple joins.
If you've got one database per user, there won't be any additional code in the application so that could be easier.
If you take to first approach there won't be a problem to move to many databases as you can have the customer_id column in all those tables. Of course then there will be the same value in this column in each table, but that's not a problem.
Personally I'd take the simple one customer one database approach. Easier to user more database servers for all customers, more difficult to show a customer data that belongs some other customer.

MySQL: Many tables or many databases?

For a project we having a bunch of data that always have the same structure and is not linked together.
There are two approaches to save the data:
Creating a new database for every pool (about 15-25 tables)
Creating all the tables in one database and differ the pools by table names.
Which one is easier and faster to handle for MySQL?
EDIT: I am not interessed in issues of database design, I am just interessed in which of the two possibilities is faster.
EDIT 2: I will try to make it more clear. As said we will have data, where some of the date rarely belongs together in different pools. Putting all the data of one type in one table and linking it with a pool id is not a good idea:
It is hard to backup/delete a specific pool (and we expect that we are running out primary keys after a while (even when use big int))
So the idea is to make a database for every pool or create a lot of tables in one database. 50% of the queries against the database will be simple inserts. 49% will be some simple selects on a primary key.
The question is, what is faster to handle for MySQL? Many tables or many databases?
There should be no significant performance difference between multiple tables in a single database versus multiple tables in separate databases.
In MySQL, databases (standard SQL uses the term "schema" for this) serve chiefly as a namespace for tables. A database has only a few attributes, e.g. the default character set and collation. And that usage of GRANT makes it convenient to control access privileges per database, but that has nothing to do with performance.
You can access tables in any database from a single connection (provided they are managed by the same instance of MySQL Server). You just have to qualify the table name:
SELECT * FROM database17.accounts_table;
This is purely a syntactical difference. It should have no effect on performance.
Regarding storage, you can't organize tables into a file-per-database as #Chris speculates. With the MyISAM storage engine, you always have a file per table. With the InnoDB storage engine, you either have a single set of storage files that amalgamate all tables, or else you have a file per table (this is configured for the whole MySQL server, not per database). In either case, there's no performance advantage or disadvantage to creating the tables in a single database versus many databases.
There aren't many MySQL configuration parameters that work per database. Most parameters that affect server performance are server-wide in scope.
Regarding backups, you can specify a subset of tables as arguments to the mysqldump command. It may be more convenient to back up logical sets of tables per database, without having to name all the tables on the command-line. But it should make no difference to performance, only convenience for you as you enter the backup command.
Why not create a single table to keep track of your pools (with a PoolID and PoolName as you columns, and whatever else you want to track) and then on your 15-25 tables you would add a column on all of them which would be a foreign key back to you pool table so you know which pool that particular record belongs to.
If you don't want to mix the data like that, I would suggest making multiple databases. Creating multiple tables all for the same functionality makes my spider sense tingle.
If you don't want one set of tables with poolID poolname as TheTXI suggested, use separate databases rather than multiple tables that all do the same thing.
That way, you restrict the variation between the accessing of different pools to the initial "use database" statement, you won't have to recode your SELECTs each time, or have dynamic sql.
The other advantages of this approach are:
Easy backup/restore
Easy start/stop of a database instance.
Disadvantages are:
a little bit more admin work, but not much.
I don't know what your application is, but really really think carefully before creating all of the tables in one database. That way madness lies.
Edit: If performance is the only thing that concerns you, you need to measure it. Take a representative set of queries and measure their performance.
Edit 2: The difference in performance for a single query between the many tables/many databases model will be neglible. If you have one database, you can tune the hell out of it. If you have many databases, you can tune the hell out of all of them.
My (our? - can't speak for anyone else) point is that, for well tuned database(s), there will be practically no difference in performance between the three options (poolid in table, multiple tables, multiple databases), so you can pick the option which is easiest for you, in the short AND long term.
For me, the best option is still one database with poolId, as TheTXI suggested, then multiple databases, depending upon your (mostly administration) needs. If you need to know exactly what the difference in performance is between two options, we can't give you that answer. You need to set it up and test it.
With multiple databases, it becomes easy to throw hardware at it to improve performance.
In the situation you describe, experience has led me to believe that you'll find the separate databases to be faster when you have a large number of pools.
There's a really important general principle to observe here, though: Don't think about how fast it'll be, profile it.
I'm not too sure I completely understand your scenario. Do you want to have all the pools using the same tables, but just differing by a distinguishing key? Or do you want separate pools of tables within the one database, with a suffix on each table to distinguish the pools?
Either way though, you should have multiple databases for two major reasons. The first being if you have to change the schema on one pool, it won't affect the others.
The second, if your load goes up (or for any other reason), you may want to move the pools onto separate physical machines with new database servers.
Also, security access to a database server can be more tightly locked down.
All of these things can still be accomplished without requiring separate databases - but the separation will make all of this easier and reduce the complexity of having to mentally track which tables you want to operate on.
Differing the pools by table name or putting them in separate databases is about the same thing. However, if you have lots of tables in one database, MySQL has to load the table information and do a security check on all those tables when logging in/connecting.
As others mentioned, separate databases will allow you to shift things around and create optimizations specific to a certain pool (i.e. compressed tables). It is extra admin overhead, but there is considerably more flexibility.
Additionally, you can always "pool" the tables that are in separate databases by using federated or merge tables to simplify querying if needed.
As for running out of primary keys, you could always use a compound primary key if you are using MyISAM tables. For example, if you have a field called groupCode (any type) and another called sequenceId (auto increment) and create your primary key as groupCode+sequenceId. The sequenceId will increment based on the next unique ID within the group code set.
For example:
AAA 1
AAA 2
BBB 1
AAA 3
CCC 1
AAA 4
BBB 2
...
Although with large tables you have to be careful about caching and make sure the file system you are using handles large files.
I don't know mysql very well, but I think I'll have to give the standard performance answer -- "It depends".
Some thoughts (dealing only with performance/maintenance, not database design):
Creating a new database means a separate file (or files) in the file system. These files could then be put on different filesystems if performance of one needs to be separate from the others, etc.
A new database will probably handle caching differently; eg. All tables in one DB is going to mean a shared cache for the DB, whereas splitting the tables into separate databases means each database can have a separate cache [obviously all databases will share the same physical memory for cache, but there may be a limit per database, etc].
Related to the separate files, this means that if one of your datasets becomes more important than the others, it can easily be pulled off to a new server.
Separating the databases has an added benefit of allowing you to deploy updates one-at-a-time more easily than with the single database.
However, to contrast, having multiple databases means the server will probably be using more memory (since it has multiple caches). I'm sure there are more "cons" for the multi-database approach, but I am drawing a blank now.
So I suppose I would recommend the multi-database approach. Obviously this is only with the understanding that there may very well be a better "database-designy" way of handling whatever you are actually doing.
Given the restrictions you've placed on it, I'd rather spin up more tables in the existing database, rather than having to connect to multiple databases. Managing connection strings TEND to be harder, in addition to managing the different database optimizations you may have.
FTR, in normal circumstances I'd take the approach described by TheTXI.
In answer to your specific question though, I have found it to be dependant on usage. (Cop out I know, but hear me out.)
A single database is probably easier. You'll have to worry about just one connection and would still have to specify tables. Multiple databases could, under certain conditions, be faster though.
If I were you I'd try both. There's no way we'll be able to give you a useful answer.