MySQL Performance of one vs. many tables - mysql

I know that MySQL usually handles tables with many rows well. However, I currently face a setting where one table will be read and written by multiple users (around 10) at the same time and it is quite possible that the table will contain 10 billion rows.
My setting is a MySQL database with an InnoDB storage engine.
I have heart of some projects where tables of that size would become less efficient and slower, also concerning indexes.
I do not like the idea of having multiple tables with exactly the same structure just to split rows. Main question: However, would this not solve the issue of having reduced performance due to such a large bunch of rows?
Additional question: What else could I do to work with such a large table? The number of rows itself is not diminishable.

I have heard of some projects where tables of that size would become less efficient and slower, also concerning indexes.
This is not typical. So long as your tables are appropriately indexed for the way you're using them, performance should remain reasonable even for extremely large tables.
(There is a very slight drop in index performance as the depth of a BTREE index increases, but this effect is practically negligible. Also, it can be mitigated by using smaller keys in your indexes, as this minimizes the depth of the tree.)
In some situations, a more appropriate solution may be partitioning your table. This internally divides your data into multiple tables, but exposes them as a single table which can be queried normally. However, partitioning places some specific requirements on how your table is indexed, and does not inherently improve query performance. It's mainly useful to allow large quantities of older data to be deleted from a table at once, by dropping older partitions from a table that's partitioned by date.

Related

MySql partitoning vs indexing performance

In MySql InnoDB, is there an performance advantage of partitioning the table compared to simply using an index?
Common considerations:
Is an Index the Best Solution?
An index isn’t always the right tool. At a high level, keep in mind that indexes are most
effective when they help the storage engine find rows without adding more work than
they avoid. For very small tables, it is often more effective to simply read all the rows
in the table. For medium to large tables, indexes can be very effective. For enormous
tables, the overhead of indexing, as well as the work required to actually use the indexes,
can start to add up. In such cases you might need to choose a technique that identifies
groups of rows that are interesting to the query, instead of individual rows. You can
use partitioning for this purpose.
If you have lots of tables, it can also make sense to create a metadata table to store some
characteristics of interest for your queries. For example, if you execute queries that
perform aggregations over rows in a multitenant application whose data is partitioned
into many tables, you can record which users of the system are actually stored in each
table, thus letting you simply ignore tables that don’t have information about those
users. These tactics are usually useful only at extremely large scales. In fact, this is a
crude approximation of what Infobright does. At the scale of terabytes, locating individual rows doesn’t make sense; indexes are replaced by per-block metadata.
One thing is sure: you can’t scan the whole table every time you want to query it,
because it’s too big. And you don’t want to use an index because of the maintenance
cost and space consumption. Depending on the index, you could get a lot of fragmentation and poorly clustered data, which would cause death by a thousand cuts through
random I/O. You can sometimes work around this for one or two indexes, but rarely
for more. Only two workable options remain: your query must be a sequential scan
over a portion of the table, or the desired portion of the table and index must fit entirely
in memory.
It’s worth restating this: at very large sizes, B-Tree indexes don’t work. Unless the index
covers the query completely, the server needs to look up the full rows in the table, and
that causes random I/O a row at a time over a very large space, which will just kill query
response times. The cost of maintaining the index (disk space, I/O operations) is also
very high. Systems such as Infobright acknowledge this and throw B-Tree indexes out
entirely, opting for something coarser-grained but less costly at scale, such as per-block
metadata over large blocks of data.
This is what partitioning can accomplish, too. The key is to think about partitioning
as a crude form of indexing that has very low overhead and gets you in the neighborhood
of the data you want. From there, you can either scan the neighborhood sequentially,
or fit the neighborhood in memory and index it. Partitioning has low overhead because
there is no data structure that points to rows and must be updated—partitioning
doesn’t identify data at the precision of rows, and has no data structure to speak of.
Instead, it has an equation that says which partitions can contain which categories of
rows.
(many thanks to High Performance MySQL great book)
99% of cases I have looked at do not benefit from PARTITIONing as much as from INDEXing.
My Rules of Thumb for using Partitioning are in http://mysql.rjweb.org/doc.php/partitionmaint . Also, that lists the only 4 use cases where partitioning improves performance.
OK, I can't say "exactly" 99%, but it is very close to that. I do believe strongly in the "4" -- I have been searching since partitioning was added to MySQL many years ago.
For Data Warehousing, the usual performance solution is to create and maintain "Summary tables". This works nicely for 'most' DW applications.
"Very large BTrees don't work"? Bull. A million-row index will have a BTree depth of about 3. A trillion rows -- about 6. Where's the "won't work"? A "point query" on a trillion row table will touch twice as many nodes in the BTree, and more of them are unlikely to be cached. But it "will work".
Infobright, with its "columnar storage", has its niche. TokuDB, with its "fractal indexing", has its niche. Neither one can say "we are better than BTrees most of the time". (Both those engines get part of their speed by compression.)
Bottom Line: Use an index. Probably a "composite" index. (More indexing tips: http://mysql.rjweb.org/doc.php/index_cookbook_mysql )

MySQL performance: many rows and columns (MyISAM)

Since I'm still in the beginning of my site design I figured now's a good time to ask this.
I know that one of the ways to optimize MySQL queries is to split your rows into seperate tables, however, that does have a few comfort issues.
What I'm considering is this: would querying a table consisting of around 1'000'000 rows and 150 columns using excellently designed indexes and getting only the needed columns from each query result in a much higher server load than splittiing the table into multiple ones, resulting in less collumns?
Big blob tables are a anti-pattern, never use them.
Normalized tables will run much much faster than a single blob.
InnoDB is optimized for many small tables that need to be joined.
Using a normalized table will save you many headaches besides:
Your data will be smaller, so more of it fits in memory.
You only store data in one place, so it cannot end up with inconsistent data.
MySQL only allows you to use one index per select per table, multiple tables means you get to use more indexes and get more speed.
Triggers on tables execute much faster.
Normalized tables are easier to maintain.
You have less indexes per table, so inserts are faster.
Indexes are smaller (fewer rows) and narrows (less columns) and will run much faster as a result.
If the data is static, you can pack the tables for greater efficiency. Here is the page in the reference manual

in mysql, does quantity of data in one table affect performance

Setting up a new database that has a comments table. I've been told to expect this table to get extremely large. I'm wondering if there is any particular reason why I wouldn't want to keep this table in the same database as the rest of the data for the site.
Quantity of data will certainly affect performance in any RDBMS, however, there's no reason that this table should exist in a separate database on the same server. If the table truly will become very large with a lot of insert activity, then you might want to consider an ETL process that keeps an indexed copy for fast selection (mostly because indexes can negatively affect inserts despite the performance gain on selects)
Keep the table in the same database for now, but it is well known that insert speed slows as the number of records increases.
There are some options to consider if the performance becomes unacceptable, for instance, if the data can be partitioned, split it across multiple tables in separate databases.

What is the optimal amount of data for a table?

How much data should be in a table so that reading is optimal? Assuming that I have 3 fields varchar(25). This is in MySQL.
I would suggest that you consider the following in optimizing your database design:
Consider what you want to accomplish with the database. Will you be performing a lot of inserts to a single table at very high rates? Or will you be performing reporting and analytical functions with the data?
Once you've determined the purpose of the database, define what data you need to store to perform whatever functions are necessary.
Normalize till it hurts. If you're performing transaction processing (the most common function for a database) then you'll want a highly normalized database structure. If you're performing analytical functions, then you'll want a more denormalized structure that doesn't have to rely on joins to generate report results.
Typically, if you've really normalized the structure till it hurts then you need to take your normalization back a step or two to have a data structure that will be both normalized and functional.
A normalized database is mostly pointless if you fail to use keys. Make certain that each table has a primary key defined. Don't use surrogate keys just cause its what you always see. Consider what natural keys might exist in any given table. Once you are certain that you have the right primary key for each table, then you need to define your foreign key references. Establishing explicit foreign key relationships rather than relying on implicit definition will give you a performance boost, provide integrity for your data, and self-document the database structure.
Look for other indexes that exist within your tables. Do you have a column or set of columns that you will search against frequently like a username and password field? Indexes can be on a single column or multiple columns so think about how you'll be querying for data and create indexes as necessary for values you'll query against.
Number of rows should not matter. Make sure the fields your searching on are indexed properly. If you only have 3 varchar(25) fields, then you probably need to add a primary key that is not a varchar.
Agree that you should ensure that your data is properly indexed.
Apart from that, if you are worried about table size, you can always implement some type of data archival strategy to later down the line.
Don't worry too much about this until you see problems cropping up, and don't optimise prematurely.
For optimal reading you should have an index. A table exists to hold the rows it was designed to contain. As the number of rows increases, the value of the index comes into play and reading remains brisk.
Phrased as such I don't know how to answer this question. An idexed table of 100,000 records is faster than an unindexed table of 1,000.
What are your requirements? How much data do you have? Once you know the answer to these questions you can make decisions about indexing and/or partitioning.
This is a very loose question, so a very loose answer :-)
In general if you do the basics - reasonable normalization, a sensible primary key and run-of-the-mill queries - then on today's hardware you'll get away with most things on a small to medium sized database - i.e. one with the largest table having less than 50,000 records.
However once you get past the 50k - 100k rows, which roughly corresponds to the point when the rdbms is likely to be memory constrained - then unless you have your access paths set up correctly (i.e. indexes) then performance will start to fall off catastrophically. That is in the mathematical sense - in such scenario's it's not unusual to see performance deteriorate by an order of magnitude or two for a doubling in table size.
Obviously therefore the critical table size at which you need to pay attention will vary depending upon row size, machine memory, activity and other environmental issues, so there is no single answer, but it is well to be aware that performance generally does not degrade gracefully with table size and plan accordingly.
I have to disagree with Cruachan about "50k - 100k rows .... roughly correspond(ing) to the point when the rdbms is likely to be memory constrained". This blanket statement is just misleading without two additional data: approx. size of the row, and available memory. I'm currently developing a database to find the longest common subsequence (a la bio-informatics) of lines within source code files, and reached millions of rows in one table, even with a VARCHAR field of close to 1000, before it became memory constrained. So, with proper indexing, and sufficient RAM (a Gig or two), as regards the original question, with rows of 75 bytes at most, there is no reason why the proposed table couldn't hold tens of millions of records.
The proper amount of data is a function of your application, not of the database. There are very few cases where a MySQL problem is solved by breaking a table into multiple subtables, if that's the intent of your question.
If you have a particular situation where queries are slow, it would probably be more useful to discuss how to improve that situation by modifying query or the table design.

Maximum table size for a MySQL database

What is the maximum size for a MySQL table? Is it 2 million at 50GB? 5 million at 80GB?
At the higher end of the size scale, do I need to think about compressing the data? Or perhaps splitting the table if it grew too big?
I once worked with a very large (Terabyte+) MySQL database. The largest table we had was literally over a billion rows.
It worked. MySQL processed the data correctly most of the time. It was extremely unwieldy though.
Just backing up and storing the data was a challenge. It would take days to restore the table if we needed to.
We had numerous tables in the 10-100 million row range. Any significant joins to the tables were too time consuming and would take forever. So we wrote stored procedures to 'walk' the tables and process joins against ranges of 'id's. In this way we'd process the data 10-100,000 rows at a time (Join against id's 1-100,000 then 100,001-200,000, etc). This was significantly faster than joining against the entire table.
Using indexes on very large tables that aren't based on the primary key is also much more difficult. Mysql stores indexes in two pieces -- it stores indexes (other than the primary index) as indexes to the primary key values. So indexed lookups are done in two parts: First MySQL goes to an index and pulls from it the primary key values that it needs to find, then it does a second lookup on the primary key index to find where those values are.
The net of this is that for very large tables (1-200 Million plus rows) indexing against tables is more restrictive. You need fewer, simpler indexes. And doing even simple select statements that are not directly on an index may never come back. Where clauses must hit indexes or forget about it.
But all that being said, things did actually work. We were able to use MySQL with these very large tables and do calculations and get answers that were correct.
About your first question, the effective maximum size for the database is usually determined by operating system, specifically the file size MySQL Server will be able to create, not by MySQL Server itself. Those limits play a big role in table size limits. And MyISAM works differently from InnoDB. So any tables will be dependent on those limits.
If you use InnoDB you will have more options on manipulating table sizes, resizing the tablespace is an option in this case, so if you plan to resize it, this is the way to go. Give a look at The table is full error page.
I am not sure the real record quantity of each table given all necessary information (OS, Table type, Columns, data type and size of each and etc...) And I am not sure if this info is easy to calculate, but I've seen simple table with around 1bi records in a couple cases and MySQL didn't gave up.