Maximum table size for a MySQL database - mysql

What is the maximum size for a MySQL table? Is it 2 million at 50GB? 5 million at 80GB?
At the higher end of the size scale, do I need to think about compressing the data? Or perhaps splitting the table if it grew too big?

I once worked with a very large (Terabyte+) MySQL database. The largest table we had was literally over a billion rows.
It worked. MySQL processed the data correctly most of the time. It was extremely unwieldy though.
Just backing up and storing the data was a challenge. It would take days to restore the table if we needed to.
We had numerous tables in the 10-100 million row range. Any significant joins to the tables were too time consuming and would take forever. So we wrote stored procedures to 'walk' the tables and process joins against ranges of 'id's. In this way we'd process the data 10-100,000 rows at a time (Join against id's 1-100,000 then 100,001-200,000, etc). This was significantly faster than joining against the entire table.
Using indexes on very large tables that aren't based on the primary key is also much more difficult. Mysql stores indexes in two pieces -- it stores indexes (other than the primary index) as indexes to the primary key values. So indexed lookups are done in two parts: First MySQL goes to an index and pulls from it the primary key values that it needs to find, then it does a second lookup on the primary key index to find where those values are.
The net of this is that for very large tables (1-200 Million plus rows) indexing against tables is more restrictive. You need fewer, simpler indexes. And doing even simple select statements that are not directly on an index may never come back. Where clauses must hit indexes or forget about it.
But all that being said, things did actually work. We were able to use MySQL with these very large tables and do calculations and get answers that were correct.

About your first question, the effective maximum size for the database is usually determined by operating system, specifically the file size MySQL Server will be able to create, not by MySQL Server itself. Those limits play a big role in table size limits. And MyISAM works differently from InnoDB. So any tables will be dependent on those limits.
If you use InnoDB you will have more options on manipulating table sizes, resizing the tablespace is an option in this case, so if you plan to resize it, this is the way to go. Give a look at The table is full error page.
I am not sure the real record quantity of each table given all necessary information (OS, Table type, Columns, data type and size of each and etc...) And I am not sure if this info is easy to calculate, but I've seen simple table with around 1bi records in a couple cases and MySQL didn't gave up.

Related

Duplicate table fields vs indexing only

I have a huge and very busy table (few thousands INSERT / second). The table stores loginlogs, it has a bigint ID which is not generated by MySQL but rather by pseudorandom generator on MySQL client.
Simply put, the table has loginlog_id, client_id, tons,of,other,columns,with,details,about,session....
I have few indexes on this table such as PRIMARY_KEY(loginlog_id) and INDEX(client_id)
In some other part of our system I need to fetch client_id based on loginlog_id. This does not happen that often (just few hundreds SELECT client_id FROM loginlogs WHERE loginlog_id=XXXXXX / second). Table loginlogs is read by various other scripts now and then, and always various columns are needed. But the most frequent call to read is for sure the above mentioned get client_id by loginlog_id.
My question is: should I create another table loginlogs_clientids and duplicate loginlog_id, client_id in there (this means another few thousands INSERTS, as for every loginlogs INSERT I get this new one). Or should I be happy with InnoDB handling my lookups by PRIMARY KEY efficiently.
We have tons of RAM (128GB, most of which is used by MySQL). Load of MySQL is between 40% and 350% CPU (we have 12 core CPU). When I tried to use the new table, I did not see any difference. But I am asking for the future, if our usage grows even more, what is the suggested approach? Duplicate or index?
Thanks!
No.
Looking up table data for a single row using the primary key is extremely efficient, and will take the same time for both tables.
Exceptions to that might be very large row sizes (e.g. 8KB+), and client_id is e.g. a varchar that is stored off-page, in which case you might need to read an additional data block, which at least theoretically could cost you some milliseconds.
Even if this strategy would have an advantage, you would not actually do it by creating a new table, but by adding an index (loginlog_id, client_id) to your original table. InnoDB stores everything, including the actual data, in an index structure, so that adding an index is basically the same as adding a new table with the same columns, but without (you) having the problem of synchronizing those two "tables".
Having a structure with a smaller row size can have some advantages for ranged scans, e.g. MySQL will evaluate select count(*) from tablename using the smallest index of the table, as it has to read less bytes. You already have such a small index (on client_id), so even in that regard, adding such an additonal table/index shouldn't have an effect. If you have any range scan on the primary key (which is probably unlikely for pseudorandom data), you may want to consider this though, or keep it in mind for cases when you have.

MySQL performance: many rows and columns (MyISAM)

Since I'm still in the beginning of my site design I figured now's a good time to ask this.
I know that one of the ways to optimize MySQL queries is to split your rows into seperate tables, however, that does have a few comfort issues.
What I'm considering is this: would querying a table consisting of around 1'000'000 rows and 150 columns using excellently designed indexes and getting only the needed columns from each query result in a much higher server load than splittiing the table into multiple ones, resulting in less collumns?
Big blob tables are a anti-pattern, never use them.
Normalized tables will run much much faster than a single blob.
InnoDB is optimized for many small tables that need to be joined.
Using a normalized table will save you many headaches besides:
Your data will be smaller, so more of it fits in memory.
You only store data in one place, so it cannot end up with inconsistent data.
MySQL only allows you to use one index per select per table, multiple tables means you get to use more indexes and get more speed.
Triggers on tables execute much faster.
Normalized tables are easier to maintain.
You have less indexes per table, so inserts are faster.
Indexes are smaller (fewer rows) and narrows (less columns) and will run much faster as a result.
If the data is static, you can pack the tables for greater efficiency. Here is the page in the reference manual

mySQL indexes, when is too many rows too much?

I am currently working on a new system looking at data stored by a CMS about user access logs.
The current table I am looking at extracting data from is currently 5 million rows. This is data spanning about about 11 months. The SQL queries I am making are usually searching on something like uid which is an indexed column.
The question I have out of interest and scalablity is how large does a table need to get when even indexed columns don't speed up searches?
Indexes will always be faster if the table is mostly read. If you expect writes to scale faster than reads, then updating the index may become more expensive than it's worth.
If uid is your primary key, then it will always be indexed and there's really no overhead for this index since MySQL needs a key for each row anyway.
Proper indexes will always speed up queries...that's the point of them. It doesn't matter how large your table is, the point of the index is to provide the DBMS with an avenue of retrieving a subset of a table faster than if it had to read through the entire table row by row.

MySQL: OPTIMIZE TABLE needed on table with fixed columns?

I have a weekly script that moves data from our live database and puts it into our archive database, then deletes the data it just archived from the live database. Since it's a decent size delete (about 10% of the table gets trimmed), I figured I should be running OPTIMIZE TABLE after this delete.
However, I'm reading this from the mysql documentation and I don't know how to interpret it:
http://dev.mysql.com/doc/refman/5.1/en/optimize-table.html
"OPTIMIZE TABLE should be used if you have deleted a large part of a table or if you have made many changes to a table with variable-length rows (tables that have VARCHAR, VARBINARY, BLOB, or TEXT columns). Deleted rows are maintained in a linked list and subsequent INSERT operations reuse old row positions. You can use OPTIMIZE TABLE to reclaim the unused space and to defragment the data file."
The first sentence is ambiguous to me. Does it mean you should run it if:
A) you have deleted a large part of a table with variable-length rows or if you have made many changes to a table with variable-length rows
OR
B) you have deleted a large part of ANY table or if you have made many changes to a table with variable-length rows
Does that make sense? So if my table has no VAR columns, do I need to run it still?
While we're on the subject - is there any indicator that tells me that a table is ripe for an OPTIMIZE call?
Also, I read this http://www.xaprb.com/blog/2010/02/07/how-often-should-you-use-optimize-table/ that says running OPTIMIZE table only is useful for the primary key. If most of my selects are from other indices, am I just wasting effort on tables that have a surrogate key?
Thanks so much!
In your scenario, I do not believe that regularly optimizing the table will make an appreciable difference.
First things first, your second interpretation (B) of the documentation is correct - "if you have deleted a large part of ANY table OR if you have made many changes to a table with variable-length rows."
If your table has no VAR columns, each record, regardless of the data it contains, takes up the exact same amount of space in the table. If a record is deleted from the table, and the DB chooses to reuse the exact area the previous record was stored, it can do so without wasting any space or fragmenting your data.
As far as whether OPTIMIZE only improves performance on a query that utilizes the primary key index, that answer would almost certainly vary based on what storage engine is in use, and I'm afraid I wouldn't be able to answer that.
However, speaking of storage engines, if you do end up using OPTIMIZE, be aware that it doesn't like to run on InnoDB tables, so the command maps to ALTER and rebuilds the table, which might be a more expensive operation. Either way, the table locks during the optimizations, so be very careful about when you run it.
There are so many differences between MyISAM and InnoDB, I am splitting this answer in two:
MyISAM
FIXED has some meaning for MyISAM.
"Deleted rows are maintained in a linked list and subsequent INSERT operations reuse old row positions" applies to MyISAM, not InnoDB. Hence, for MyISAM tables with a lot of churn, OPTIMIZE can be beneficial.
In MyISAM, VAR plus DELETE/UPDATE leads to fragmentation.
Because of the linked list and VAR, a single row can be fragmented across the data file (.MYD). (Otherwise, a MyISAM row is contiguous in the data file.)
InnoDB
FIXED has no meaning for InnoDB tables.
For VAR in InnoDB, there are "block splits", not a linked list.
In a BTree, block splits stabilize at and average 69% full. So, with InnoDB, almost any abuse will leave the table not too bloated. That is, DELETE/UPDATE (with or without VAR) leads to the more limited BTree 'fragmentation'.
In InnoDB, emptied blocks (16KB each) are put on a "free list" for reuse; they are not given back to the OS.
Data in InnoDB is ordered by the PRIMARY KEY, so deleting a row in one part of the table does not provide space for a new row in another part of the table. But, when a block is freed up, it can be used elsewhere.
Two adjacent blocks that are half empty will be coalesced, thereby freeing up a block.
Both
If you are removing "old" data (your 10%), then PARTITIONing is a much better way to do it. See my blog. It involves DROP PARTITION, which is instantaneous and gives space back to the OS, plus REORGANIZE PARTITION, which can be instantaneous.
OPTIMIZE TABLE is almost never worth doing.

What is the optimal amount of data for a table?

How much data should be in a table so that reading is optimal? Assuming that I have 3 fields varchar(25). This is in MySQL.
I would suggest that you consider the following in optimizing your database design:
Consider what you want to accomplish with the database. Will you be performing a lot of inserts to a single table at very high rates? Or will you be performing reporting and analytical functions with the data?
Once you've determined the purpose of the database, define what data you need to store to perform whatever functions are necessary.
Normalize till it hurts. If you're performing transaction processing (the most common function for a database) then you'll want a highly normalized database structure. If you're performing analytical functions, then you'll want a more denormalized structure that doesn't have to rely on joins to generate report results.
Typically, if you've really normalized the structure till it hurts then you need to take your normalization back a step or two to have a data structure that will be both normalized and functional.
A normalized database is mostly pointless if you fail to use keys. Make certain that each table has a primary key defined. Don't use surrogate keys just cause its what you always see. Consider what natural keys might exist in any given table. Once you are certain that you have the right primary key for each table, then you need to define your foreign key references. Establishing explicit foreign key relationships rather than relying on implicit definition will give you a performance boost, provide integrity for your data, and self-document the database structure.
Look for other indexes that exist within your tables. Do you have a column or set of columns that you will search against frequently like a username and password field? Indexes can be on a single column or multiple columns so think about how you'll be querying for data and create indexes as necessary for values you'll query against.
Number of rows should not matter. Make sure the fields your searching on are indexed properly. If you only have 3 varchar(25) fields, then you probably need to add a primary key that is not a varchar.
Agree that you should ensure that your data is properly indexed.
Apart from that, if you are worried about table size, you can always implement some type of data archival strategy to later down the line.
Don't worry too much about this until you see problems cropping up, and don't optimise prematurely.
For optimal reading you should have an index. A table exists to hold the rows it was designed to contain. As the number of rows increases, the value of the index comes into play and reading remains brisk.
Phrased as such I don't know how to answer this question. An idexed table of 100,000 records is faster than an unindexed table of 1,000.
What are your requirements? How much data do you have? Once you know the answer to these questions you can make decisions about indexing and/or partitioning.
This is a very loose question, so a very loose answer :-)
In general if you do the basics - reasonable normalization, a sensible primary key and run-of-the-mill queries - then on today's hardware you'll get away with most things on a small to medium sized database - i.e. one with the largest table having less than 50,000 records.
However once you get past the 50k - 100k rows, which roughly corresponds to the point when the rdbms is likely to be memory constrained - then unless you have your access paths set up correctly (i.e. indexes) then performance will start to fall off catastrophically. That is in the mathematical sense - in such scenario's it's not unusual to see performance deteriorate by an order of magnitude or two for a doubling in table size.
Obviously therefore the critical table size at which you need to pay attention will vary depending upon row size, machine memory, activity and other environmental issues, so there is no single answer, but it is well to be aware that performance generally does not degrade gracefully with table size and plan accordingly.
I have to disagree with Cruachan about "50k - 100k rows .... roughly correspond(ing) to the point when the rdbms is likely to be memory constrained". This blanket statement is just misleading without two additional data: approx. size of the row, and available memory. I'm currently developing a database to find the longest common subsequence (a la bio-informatics) of lines within source code files, and reached millions of rows in one table, even with a VARCHAR field of close to 1000, before it became memory constrained. So, with proper indexing, and sufficient RAM (a Gig or two), as regards the original question, with rows of 75 bytes at most, there is no reason why the proposed table couldn't hold tens of millions of records.
The proper amount of data is a function of your application, not of the database. There are very few cases where a MySQL problem is solved by breaking a table into multiple subtables, if that's the intent of your question.
If you have a particular situation where queries are slow, it would probably be more useful to discuss how to improve that situation by modifying query or the table design.