Best approach to relating databases or tables? - mysql

What I have:
A MySQL database running on Ubuntu that maintains a
large table of articles (similar to
wordpress).
Need to relate a given article to
another set of data. This set of data
will be fairly large.
There maybe various sets of data that
will be related.
The query:
Is it better to contain these various large sets of data within the same database of articles, which will have a large set of traffic on it?
or
Is it better to create different databases (on the same server) that
relate by a primary key to the main database with the articles?

Put them all in the same DB initially, until you find that there is a performance issue. Much easier than prematurely optimising.
Modern RDBMS are very good at optimising data access.

If you need to connect frequently and read both of the records, you should put in a the same database. The server then won't have to run permission checks twice for each of your databases.
If you have serious traffic, you should consider using persistent connection for that query.
If you don't need to read them together frequently, consider to put on different machine. As the high traffic for the bigger database won't cause slow downs on the other.

Different databases on the same server gives you all the problems of a distributed architecture without any of the benefits of scaling out. One database per server is the way to go.

When you say 'same database' and 'different databases related' don't you mean 'same table' vs 'different tables'?
if that's the question, i'd say:
one table for articles
if these 'other sets of data' are all of the same structure, put them all in the same table. if not, one table per kind of data.
everything on the same database
if you grow big enough to make database size a performance issue (after many million records and lots of queries a second), consider table partitioning or maybe replacing the biggest table with a key/value store (couchDB, mongoDB, redis, tokyo cabinet, [etc][6]), which can be a little faster than MySQL but a lot easier to distribute for performance.
[6]:key-value store

Related

Giant unpartitioned MySQL table issues

I have a MySQL table which is about 8TB in size. As you can imagine, querying is horrendous.
I am thinking about:
Create a new table with partitions
Loop through a series of queries to dump data into those partitions
But the loop will require lots of queries to be submitted & each will be REALLY slow.
Is there a better way to do this? Repartitioning a production database in-situ isn't going to work - this seemed like an OK option, but slow
And is there a tool that will make life easier? Rather than a Python job looping & submitting jobs?
Thanks a lot in advance
You could use pt-online-schema-change. This free tool allows you to partition the table with an ALTER TABLE statement, but it does not block clients from using the table while it's restructuring it.
Another useful tool could be pt-archiver. You would create a new table with your partitioning idea, then pt-archiver to gradually copy or move data from the old table to the new table.
Of course try out using these tools in a test environment on a much smaller table first, so you get some practice using them. Do not try to use them for the first time on your 8TB table.
Regardless of what solution you use, you are going to need enough storage space to store the entire dataset twice, plus binary logs. The old table will not shrink, even as you remove data from it. So I hope your filesystem is at least 24TB. Or else the new table should be stored on a different server (or ideally several other servers).
It will also take a long time no matter which solution you use. I expect at least 4 weeks, and perhaps longer if you don't have a very powerful server with direct-attached NVMe storage.
If you use remote storage (like Amazon EBS) it may not finish before you retire from your career!
In my opinion, 8TB for a single table is a problem even if you try partitioning. Partitioning doesn't magically fix performance, and could make some queries worse. Do you have experience with querying partitioned tables? And you understand how partition pruning works, and when it doesn't work?
Before you choose partitioning as your solution, I suggest you read the whole chapter on partitioning in the MySQL manual: https://dev.mysql.com/doc/refman/8.0/en/partitioning.html, especially the page on limitations: https://dev.mysql.com/doc/refman/8.0/en/partitioning-limitations.html Then try it out with a smaller table.
A better strategy than partitioning for data at this scale is to split the data into shards, and store each shard on one of multiple database servers. You need a strategy for adding more shards as I assume the data will continue to grow.

What data quantity is considered as too big for MySQL?

I am looking for a free SQL database able to handle my data model. The project is a production database working in a local network not connected to the internet without any replication. The number of application connected at the same times would be less than 10.
The data volume forecast for the next 5 years are:
3 tables of 100 millions rows
2 tables of 500 millions rows
20 tables with less than 10k rows
My first idea was to use MySQL, but I have found around the web several articles saying that MySQL is not designed for big database. But, what is the meaning of big in this case?
Is there someone to tell me if MySQL is able to handle my data model?
I read that Postgres would be a good alternative, but require a lot of hours for tuning to be efficient with big tables.
I don't think so that my project would use NOSQL database.
I would know if someone has some experience to share with regarding MySQL.
UPDATE
The database will be accessed by C# software (max 10 at the same times) and web application (2-3 at the same times),
It is important to mention that only few update will be done on the big tables, only insert query. Delete statements will be only done few times on the 20 small tables.
The big tables are very often used for select statement, but the most often in the way to know if an entry exists, not to return grouped and ordered batch of data.
I work for Percona, a company that provides consulting and other services for MySQL solutions.
For what it's worth, we have worked with many customers who are successful using MySQL with very large databases. Terrabytes of data, tens of thousands of tables, tables with billions of rows, transaction load of tens of thousands of requests per second. You may get some more insight by reading some of our customer case studies.
You describe the number of tables and the number of rows, but nothing about how you will query these tables. Certainly one could query a table of only a few hundred rows in a way that would not scale well. But this can be said of any database, not just MySQL.
Likewise, one could query a table that is terrabytes in size in an efficient way. It all depends on how you need to query it.
You also have to set specific goals for performance. If you want queries to run in milliseconds, that's challenging but doable with high-end hardware. If it's adequate for your queries to run in a couple of seconds, you can be a lot more relaxed about the scalability.
The point is that MySQL is not a constraining factor in these cases, any more than any other choice of database is a constraining factor.
Re your comments.
MySQL has referential integrity checks in its default storage engine, InnoDB. The claim that "MySQL has no integrity checks" is a myth often repeated over the years.
I think you need to stop reading superficial or outdated articles about MySQL, and read some more complete and current documentation.
MySQLPerformanceBlog.com
High Performance MySQL, 3rd edition
MySQL 5.6 manual
MySQL has a two important (and significantly different) database engines - MyISAM and InnoDB. A limits depends on usage - MyISAM is nontransactional - there is relative fast import, but it is too simple (without own memory cache) and JOINs on tables higher than 100MB can be slow (due too simple MySQL planner - hash joins is supported from 5.6). InnoDB is transactional and is very fast on operations based on primary key - but import is slower.
Current versions of MySQL has not good planner as Postgres has (there is progress) - so complex queries are usually much better on PostgreSQL - and really simple queries are better on MySQL.
Complexity of PostgreSQL configuration is myth. It is much more simple than MySQL InnoDB configuration - you have to set only five parameters: max_connection, shared_buffers, work_mem, maintenance_work_mem and effective_cache_size. Almost all is related to available memory for Postgres on server. Usually work for 5 minutes. On my experience a databases to 100GB is usually without any problems on Postgres (probably on MySQL too). There are two important factors - how speed you expect and how much memory and how fast IO you have.
With large databases you have to have a experience and knowledges for any database technology. All is fast when you are in memory, and when ratio database size/memory is higher, then much more work you have to do to get good results.
First of all, MySQLs table size is only limited by the allowed file size limit of your OS which is I. The terra bytes on any modern OS. That would pose no problems. Most important are questions like this:
What kind of queries will you run?
Are the large table records updated frequently or basically archives for history data?
What is your hardware budget?
What is the kind of query speed you need?
Are you familiar with table partitioning, archive tables, config tuning?
How fast do you need to write (expected inserts per second)
What language will you use to connect to the db (Java, .net, Ruby etc)
What platform are you most familiar with?
Will you run queries which might cause table scans such like '%something%' which would have to go through every single row and take forever
MySQL is used by Facebook, google, twitter and others with large tables and 100,000,000 is not much in the age of social media. MySQL has very little drawbacks (even though I prefer postgresql in most cases) like altering large tables by adding a new index for example. That might send your company in a couple days forced vacation if you don't have a replica in the meantime. Is there a reason why NoSQL is not an option? Sometimes hybrid approaches are a good choice like having your relational business logic in MySQL and huge statistical tables in a NoSQL database like MongoDb which can scale by adding new servers in minutes (MySQL can too but it's more complicated). Now MongoDB can have a indexed column which can be searched by in blistering speed.
Bejond the bottom line: you need to answer the above questions first to make a very informed decision. If you have huge tables and only search on indexed keys almost any database will do - if you expect many changes to the structure down the road you want to use a different approach.
Edit:
Based on your update you just posted I doubt you would run into problems.

Side effect of large number of MySQL tables in a database

Is it OK to keep 10000+ tables in a MySQL database?
I'm making a messaging/chat script, so I'm thinking about partitioning data's over several tables as it will be a huge amount of data after some days.
IS IT OK?
Or it has some effect?
Well, as a table can hold millions of rows so I was thinking maybe a database can hold large number of tables too
or, the question could be like, how does Facebook stores their huge amount of daily chat messages?
I'm a newbie in MySQL, please help
MySQL has no limit on the number of tables. The underlying file system may have a limit on the number of files that represent tables. Individual storage engines may impose engine-specific constraints. InnoDB permits up to 4 billion tables.
Even so, the typical DBMS will 'handle' such large databases, but there is more strain on the system catalog than usual in such systems.
I have about huge tables in one database with no ill effects, other than displaying the table list in phpMyAdmin taking a while
It's possible, but I would avoid it unless you have a really good use case for it. It raises all kinds of scalability and maintainability issues. Your table size is mainly limited by available disk space.
If you really need to do it...
You'll need to increase the maximum number of file descriptors that your OS will allow to have open, since MyISAM tables use two file descriptors per table. (If you're using Linux then read the section about ulimit in the man page for bash for how to do this).
Also, there's a MySQL config value called table_cache that limits the number of allowed tables. You'll need to make sure that's large enough to support the number of tables you need.
You won't want to use the standard "flush tables" anymore (unless you're the kind of person that likes to watch paint dry) so you'll need to flush each table individually (e.g. before shutdown).
Again, I would avoid using so many tables. You're probably better off making your schema support what you need in a handful of tables, and consider archiving, warehousing (or deleting!) old data if you're concerned about storing too much data.

Performing Heavy Crunching On a Table Without Affecting the Table

I'm looking for some general advice on the best way to perform heavy crunching/data-mining on a database table, without affecting the performance of regular site queries on the table. Some of the calculations may involve joining several tables, and involve complex sorting and ordering. So "use better indexes" isn't always the solution.
This question isn't really specific. I'm looking for a general way to solve a problem that's come up many times over the years. So I don't have a specific table schema to show, a specific query to show. I've considered dumping the table first using mysqldump, and then re-importing the table under a different name, and then performing my heavy crunching on that temp table. My sysadmin hates the idea, so I'm looking for any other solutions people have come up with to deal with this type of problem.
If your "heavy crunching" is all read only and you are not doing anything that needs to be written back into your production data, use a Master/Slave replication and use the Slave for all your reporting and data analysis needs. The replication link will keep the values up to date on the Slave, and you can hit the Slave with as much load as you want without slowing down the Master which is serving your production system.
If you want to avoid affecting performance of your production database, the only solution I have used previously is to run your queries on another database server.
I would take a backup of the entire database and then restore it on a separate server.
Obviously, you cannot do this if you want to analyze real-time data. But for most analysis, a snapshot from the previous day is sufficient.

Hadoop (+HBase/HDFS) vs Mysql (or Postgres) - Loads of independent, structured data to be processed and queried

Hi there at SO,
I would like some ideas/comments on the following from you honorable and venerable bunch.
I have a 100M records which I need to process. I have 5 nodes (in a rocks cluster) to do this. The data is very structured and falls nicely in the relational data model. I want to do things in parallel since my processing takes some time.
As I see it I have two main options:
Install mysql on each node and put 20M records on each. Use the head node to delegate queries to the nodes and aggregate the results. Query Capabilities++, but I might risk some headaches when I come to choose partitioning strategies etc. (Q. Is this what they call mysql/postgres cluster?). The really bad part is that the processing of the records is left up to me now to take care of (how to distribute across machines etc)...
Alternatively install Hadoop, Hive and HBase (note that this might not be the most efficient way to store my data, since HBase is column oriented) and just define the nodes. We write everything in the MapReduce paradigm and, bang, we live happily ever after. The problem here is that we loose the "real time" query capabilities (I know you can use Hive, but that is not suggested for real time queries - which I need) - since I also have some normal sql queries to execute at times "select * from wine where colour = 'brown'".
Note that in theory - if I had 100M machines I could do the whole thing instantly since for each record the processing is independent of the other. Also - my data is read-only. I do not envisage any updates happening. I do not need/want 100M records on one node. I do not want there to be redundant data (since there is lots of it) so keeping it in BOTH mysql/postgres and Hadoop/HBase/HDFS. is not a real option.
Many Thanks
Can you prove that MySQL is the bottleneck? 100M records is not that many, and it looks like that you're not performing complex queries. Without knowing exactly what kind of processing, here is what I would do, in this order:
Keep the 100M in MySQL. Take a look at Cloudera's Sqoop utility to import records from the database and process them in Hadoop.
If MySQL is the bottleneck in (1), consider setting up slave replication, which will let you parallelize reads, without the complexity of a sharded database. Since you've already stated that you don't need to write back to the database, this should be a viable solution. You can replicate your data to as many servers as needed.
If you are running complex select queries from the database, and (2) is still not viable, then consider using Sqoop to import your records and do whatever query transformations you require in Hadoop.
In your situation, I would resist the temptation to jump off of MySQL, unless it is absolutely necessary.
There are a few questions to ask, before suggesting.
Can you formulate your queries to access by primary key only? In other words - if you can avoid all joins and table scans. If so - HBase is an option, if you need very high rate of read/write accesses.
I do noth thing that Hive is good option taking into consideration low data volume. If you expect them to grow significantly - you can consider it. In any case Hive is good for the analytical workloads - not for the OLTP type of processing.
If you do need relational model with joins and scans - I think good solution might be one Master Node and 4 slaves, with replication between them. You will direct all writes to the master, and balance reads among whole cluster. It is especially good if you have much more reads then writes.
In this schema you will have all 100M records (not that match) on each node. Within each node you can employ partitioning if appropriate.
You may also want to consider using Cassandra. I recently discovered this article on HBase vs. Cassandra which I was reminded of when I read your post.
The gist of it is that Cassandra is a highly scallable NoSQL solution with fast querying, which sort of sounds like the solution you're looking for.
So, it all depends on whether you need to maintain your relational model or not.
HI,
I had a situation where I had many tables which I created in parallel using sqlalchemy and the python multiprocessing library. I had multiple files, one per table, and loaded them using parallel COPY processes. If each process corresponds to a separate table, that works well. With one table, using COPY would be difficult. You could use tables partitioning in PostgreSQL, I guess. If you are interested I can give more details.
Regards.