performance effect of joining tables from different databases - mysql

I have a web site using a database named lets say "site1". I am planning to put another site on the same server which will also use some of the tables from "site1".
So should I use three different databases like "site1" (for first site specific data), "site2" (for second site specific data), and "general" (for common tables). In which there will be join statements between databases general and site1 and site2. Or should I put all tables in one database?
Which is the best practice to do?
How performances differ in each situation?
I am using MySQL. So how is the situation especially for MySQL?
Thanks in advance...

From the performance point of view, there won't be ANY difference. Just keep your indexes in place and you will not notice whether you are using single DB or multiple DBs.
Apart from performance, there are 2 small implications that I can think of:
1. You can not have foreign keys across DBs.
2. Partitioning tables in DB based on their usage or based on applications can help you manage permissions in easy way.

I can speak from recent personal experience. I have some old mysql queries in some PHP code that worked fine with a relatively small database, but as it grew the query got slower and slower.
I have freeradius running mysql in its own database along with another management php app that I wrote. The freeradius table is > 1.5 million rows. I was attempting to join tables from my app's database to the freeradius database. I can say for sure 1.5 million rows is too many. Running some queries locked up my app altogether. I ended up having to re-write portions of my php app to do things differently (ie not joining 2 tables from different database). I also indexed the radius accounting table on some key fields and optimized some queries (mysql EXPLAIN statement is wonderful to help with this). Things are MUCH faster now.
I will definitely be hesitant to join 2 tables from different databases in the future unless really really necessary.

Related

What data quantity is considered as too big for MySQL?

I am looking for a free SQL database able to handle my data model. The project is a production database working in a local network not connected to the internet without any replication. The number of application connected at the same times would be less than 10.
The data volume forecast for the next 5 years are:
3 tables of 100 millions rows
2 tables of 500 millions rows
20 tables with less than 10k rows
My first idea was to use MySQL, but I have found around the web several articles saying that MySQL is not designed for big database. But, what is the meaning of big in this case?
Is there someone to tell me if MySQL is able to handle my data model?
I read that Postgres would be a good alternative, but require a lot of hours for tuning to be efficient with big tables.
I don't think so that my project would use NOSQL database.
I would know if someone has some experience to share with regarding MySQL.
UPDATE
The database will be accessed by C# software (max 10 at the same times) and web application (2-3 at the same times),
It is important to mention that only few update will be done on the big tables, only insert query. Delete statements will be only done few times on the 20 small tables.
The big tables are very often used for select statement, but the most often in the way to know if an entry exists, not to return grouped and ordered batch of data.
I work for Percona, a company that provides consulting and other services for MySQL solutions.
For what it's worth, we have worked with many customers who are successful using MySQL with very large databases. Terrabytes of data, tens of thousands of tables, tables with billions of rows, transaction load of tens of thousands of requests per second. You may get some more insight by reading some of our customer case studies.
You describe the number of tables and the number of rows, but nothing about how you will query these tables. Certainly one could query a table of only a few hundred rows in a way that would not scale well. But this can be said of any database, not just MySQL.
Likewise, one could query a table that is terrabytes in size in an efficient way. It all depends on how you need to query it.
You also have to set specific goals for performance. If you want queries to run in milliseconds, that's challenging but doable with high-end hardware. If it's adequate for your queries to run in a couple of seconds, you can be a lot more relaxed about the scalability.
The point is that MySQL is not a constraining factor in these cases, any more than any other choice of database is a constraining factor.
Re your comments.
MySQL has referential integrity checks in its default storage engine, InnoDB. The claim that "MySQL has no integrity checks" is a myth often repeated over the years.
I think you need to stop reading superficial or outdated articles about MySQL, and read some more complete and current documentation.
MySQLPerformanceBlog.com
High Performance MySQL, 3rd edition
MySQL 5.6 manual
MySQL has a two important (and significantly different) database engines - MyISAM and InnoDB. A limits depends on usage - MyISAM is nontransactional - there is relative fast import, but it is too simple (without own memory cache) and JOINs on tables higher than 100MB can be slow (due too simple MySQL planner - hash joins is supported from 5.6). InnoDB is transactional and is very fast on operations based on primary key - but import is slower.
Current versions of MySQL has not good planner as Postgres has (there is progress) - so complex queries are usually much better on PostgreSQL - and really simple queries are better on MySQL.
Complexity of PostgreSQL configuration is myth. It is much more simple than MySQL InnoDB configuration - you have to set only five parameters: max_connection, shared_buffers, work_mem, maintenance_work_mem and effective_cache_size. Almost all is related to available memory for Postgres on server. Usually work for 5 minutes. On my experience a databases to 100GB is usually without any problems on Postgres (probably on MySQL too). There are two important factors - how speed you expect and how much memory and how fast IO you have.
With large databases you have to have a experience and knowledges for any database technology. All is fast when you are in memory, and when ratio database size/memory is higher, then much more work you have to do to get good results.
First of all, MySQLs table size is only limited by the allowed file size limit of your OS which is I. The terra bytes on any modern OS. That would pose no problems. Most important are questions like this:
What kind of queries will you run?
Are the large table records updated frequently or basically archives for history data?
What is your hardware budget?
What is the kind of query speed you need?
Are you familiar with table partitioning, archive tables, config tuning?
How fast do you need to write (expected inserts per second)
What language will you use to connect to the db (Java, .net, Ruby etc)
What platform are you most familiar with?
Will you run queries which might cause table scans such like '%something%' which would have to go through every single row and take forever
MySQL is used by Facebook, google, twitter and others with large tables and 100,000,000 is not much in the age of social media. MySQL has very little drawbacks (even though I prefer postgresql in most cases) like altering large tables by adding a new index for example. That might send your company in a couple days forced vacation if you don't have a replica in the meantime. Is there a reason why NoSQL is not an option? Sometimes hybrid approaches are a good choice like having your relational business logic in MySQL and huge statistical tables in a NoSQL database like MongoDb which can scale by adding new servers in minutes (MySQL can too but it's more complicated). Now MongoDB can have a indexed column which can be searched by in blistering speed.
Bejond the bottom line: you need to answer the above questions first to make a very informed decision. If you have huge tables and only search on indexed keys almost any database will do - if you expect many changes to the structure down the road you want to use a different approach.
Edit:
Based on your update you just posted I doubt you would run into problems.

Run analytics on huge MySQL database

I have a MySQL database with a few (five to be precise) huge tables. It is essentially a star topology based data warehouse. The table sizes range from 700GB (fact table) to 1GB and whole database goes upto 1 terabyte. Now I have been given a task of running analytics on these tables which might even include joins.
A simple analytical query on this database can be "find number of smokers per state and display it in descending order" this requirement could be converted in a simple query like
select state, count(smokingStatus) as smokers
from abc
having smokingstatus='current smoker'
group by state....
This query (and many other of same nature) takes a lot of time to execute on this database, time taken is in order of tens of hours.
This database is also heavily used for insertion which means every few minutes there are thousands of rows getting added.
In such a scenario how can I tackle this querying problem?
I have looked in Cassandra which seemed easy to implement but I am not sure if it is going to be as easy for running analytical queries on the database especially when I have to use "where clause and group by construct"
Have Also looked into Hadoop but I am not sure how can I implement RDBMS type queries. I am not too sure if I want to right away invest in getting at least three machines for name-node, zookeeper and data-nodes!! Above all our company prefers windows based solutions.
I have also thought of pre-computing all the data in a simpler summary tables but that limits my ability to run different kinds of queries.
Are there any other ideas which I can implement?
EDIT
Following is the mysql environment setup
1) master-slave setup
2) master for inserts/updates
3) slave for reads and running stored procedures
4) all tables are innodb with files per table
5) indexes on string as well as int columns.
Pre-calculating values is an option but since requirements for this kind of ad-hoc aggregated values keeps changing.
Looking at this from the position of attempting to make MySQL work better rather than positing an entirely new architectural system:
Firstly, verify what's really happening. EXPLAIN the queries which are causing issues, rather than guessing what's going on.
Having said that, I'm going to guess as to what's going on since I don't have the query plans. I'm guessing that (a) your indexes aren't being used correctly and you're getting a bunch of avoidable table scans, (b) your DB servers are tuned for OLTP, not analytical queries, (c) writing data while reading is causing things to slow down greatly, (d) working with strings just sucks and (e) you've got some inefficient queries with horrible joins (everyone has some of these).
To improve things, I'd investigate the following (in roughly this order):
Check the query plans, make sure the existing indexes are being used correctly - look at the table scans, make sure the queries actually make sense.
Move the analytical queries off the OLTP system - the tunings required for fast inserts and short queries are very different to those for the sorts of queries which potentially read most of a large table. This might mean having another analytic-only slave, with a different config (and possibly table types - I'm not sure what the state of the art with MySQL is right now).
Move the strings out of the fact table - rather than having the smoking status column with string values of (say) 'current smoker', 'recently quit', 'quit 1+ years', 'never smoked', push these values out to another table, and have the integer keys in the fact table (this will help the sizes of the indexes too).
Stop the tables from being updated while the queries are running - if the indexes are moving while the query is running I can't see good things happening. It's (luckily) been a long time since I cared about MySQL replication, so I can't remember if you can batch up the writes to the analytical query slave without too much drama.
If you get to this point without solving the performance issues, then it's time to think about moving off MySQL. I'd look at Infobright first - it's open source/$$ & based on MySQL, so it's probably the easiest to put into your existing system (make sure the data is going to the InfoBright DB, then point your analytical queries to the Infobright server, keep the rest of the system as it is, job done), or if Vertica ever releases its Community Edition. Hadoop+Hive has a lot of moving parts - its pretty cool (and great on the resume), but if it's only going to be used for the analytic portion of you system it may take more care & feeding than other options.
1 TB is not that big. MySQL should be able to handle that. At least simple queries like that shouldn't take hours! Can't be very helpful without knowing the larger context, but I can suggest some questions that you might ask yourself, mostly related to how you use your data:
Is there a way you can separate the reads and writes? How many read so you do per day and how many writes? Can you live with some lag, e.g write to a new table each day and merge it to the existing table at the end of the day?
What are most of your queries like? Are they mostly aggregation queries? Can you do some partial aggregation beforehand? Can you pre-calculate number of new smokers every day?
Can you use hadoop for the aggregation process above? Hadoop is kinda good at that stuff. Basically use hadoop just for daily or batch processing and store the results into the DB.
On the DB side, are you using InnoDB or MyISAM? Are the indices on String columns? Can you make it ints etc.?
Hope that helps
MySQL is have a serious limitation what prevent him to be able to perform good on such scenarious. The problem is a lack of parralel query capability - it can not utilize multiple CPUs in the single query.
Hadoop has an RDMBS like addition called Hive. It is application capable of translate your queries in Hive QL (sql like engine) into the MapReduce jobs. Since it is actually small adition on top of Hadoop it inherits its linear scalability
I would suggest to deploy hive alongside MySQL, replicate daily data to there and run heavy aggregations agains it. It will offload serious part of the load fro MySQL. You still need it for the short interactive queries, usually backed by indexes. You need them since Hive is iherently not-interactive - each query will take at least a few dozens of seconds.
Cassandra is built for the Key-Value type of access and does not have scalable GroupBy capability build-in. There is DataStax's Brisk which integrate Cassandra with Hive/MapReduce but it might be not trivial to map your schema into Cassandra and you still not get flexibility and indexing capabiilties of the RDBMS.
As a bottom line - Hive alongside MySQL should be good solution.

Max tables in a MySQL database

Is it bad to have too many tables in a database? I have about 160 tables in one database. Is it better to split it into several database rather than using a single database? Single database is more convenient for me.
There are no server limits on the number of tables in a MySQL database. You will definitely have no problems with 160 tables, and you don't need to split them into multiple databases.
You will not gain performance by splitting your tables into multiple databases. If performance remains an issue, you could consider using per-table tablespaces in order to place some sets of tables on different physical disks.
according to MySQL reference manual:
MySQL has no limit on the number of tables. The underlying file system may have a limit on the number of files that represent tables. Individual storage engines may impose engine-specific constraints. InnoDB permits up to 4 billion tables.
160 tables isn't radically huge.
16,000 might be...probably would be...more unreasonable - such databases exist in ERP or CRM systems (even into the 40-50K tables range, but many of those tables are not actually used, or are only barely used).
Even so, the typical DBMS will 'handle' such large databases, but there is more strain on the system catalog than usual in such systems.
160 is still ok. it makes SQL command more faster than making too many contents under a single table. In my case I have 8,545,214 tables in a single Mysql database. I dont want to store millions of user in a single database that's why I used multiple table to store each post a user done. it makes mysql more faster than searching on a single table with millions of rows.
WordPress Multisite creates dozens of tables for every new subsite in the same database.
So you can be so good at only 160 tables.
It might be an issue to manage them with PhpMyAdmin or other software to see and scroll through the tables. But if you work with the code it should not be a problem.

Best approach to relating databases or tables?

What I have:
A MySQL database running on Ubuntu that maintains a
large table of articles (similar to
wordpress).
Need to relate a given article to
another set of data. This set of data
will be fairly large.
There maybe various sets of data that
will be related.
The query:
Is it better to contain these various large sets of data within the same database of articles, which will have a large set of traffic on it?
or
Is it better to create different databases (on the same server) that
relate by a primary key to the main database with the articles?
Put them all in the same DB initially, until you find that there is a performance issue. Much easier than prematurely optimising.
Modern RDBMS are very good at optimising data access.
If you need to connect frequently and read both of the records, you should put in a the same database. The server then won't have to run permission checks twice for each of your databases.
If you have serious traffic, you should consider using persistent connection for that query.
If you don't need to read them together frequently, consider to put on different machine. As the high traffic for the bigger database won't cause slow downs on the other.
Different databases on the same server gives you all the problems of a distributed architecture without any of the benefits of scaling out. One database per server is the way to go.
When you say 'same database' and 'different databases related' don't you mean 'same table' vs 'different tables'?
if that's the question, i'd say:
one table for articles
if these 'other sets of data' are all of the same structure, put them all in the same table. if not, one table per kind of data.
everything on the same database
if you grow big enough to make database size a performance issue (after many million records and lots of queries a second), consider table partitioning or maybe replacing the biggest table with a key/value store (couchDB, mongoDB, redis, tokyo cabinet, [etc][6]), which can be a little faster than MySQL but a lot easier to distribute for performance.
[6]:key-value store

MySQL Simple Table Synchronization?

I'm developing a website which to begin with will have three clear sub sites: Forum, News and a Calendar.
Each sub site will have it's own database and common to all of these databases will be a user table which needs to be in each database so that joins can be done.
How can I synchronize all the user tables so that it doesn't matter in which database I make an update, all the databases will have the same user table.
I'm not worried if there is a short sync delay (less than 1min) and I would prefer that the solution was a simple as possible.
Why do the sub-sites need to have their own databases? Can't you just use one database, with separate tables for each of the applications? Or, in PostgreSQL, you could use schemas to the same effect.
Though I would hardly endorse an architecture like this, federated tables may do what you want.
A single app can log into more than one database. While I'd advocate kquinn's answer of "all in one DB", becaue joins will work then, if you really must have separate databases, at least have the user table accessed from one database. "Cloning" a table across multiple databases is fraught with so much peril it's not funny.
I was over complicating the problems/solution.
Since the databases will (for the time being) exist on the same server, I can use a very simple View.