I have a weekly script that moves data from our live database and puts it into our archive database, then deletes the data it just archived from the live database. Since it's a decent size delete (about 10% of the table gets trimmed), I figured I should be running OPTIMIZE TABLE after this delete.
However, I'm reading this from the mysql documentation and I don't know how to interpret it:
http://dev.mysql.com/doc/refman/5.1/en/optimize-table.html
"OPTIMIZE TABLE should be used if you have deleted a large part of a table or if you have made many changes to a table with variable-length rows (tables that have VARCHAR, VARBINARY, BLOB, or TEXT columns). Deleted rows are maintained in a linked list and subsequent INSERT operations reuse old row positions. You can use OPTIMIZE TABLE to reclaim the unused space and to defragment the data file."
The first sentence is ambiguous to me. Does it mean you should run it if:
A) you have deleted a large part of a table with variable-length rows or if you have made many changes to a table with variable-length rows
OR
B) you have deleted a large part of ANY table or if you have made many changes to a table with variable-length rows
Does that make sense? So if my table has no VAR columns, do I need to run it still?
While we're on the subject - is there any indicator that tells me that a table is ripe for an OPTIMIZE call?
Also, I read this http://www.xaprb.com/blog/2010/02/07/how-often-should-you-use-optimize-table/ that says running OPTIMIZE table only is useful for the primary key. If most of my selects are from other indices, am I just wasting effort on tables that have a surrogate key?
Thanks so much!
In your scenario, I do not believe that regularly optimizing the table will make an appreciable difference.
First things first, your second interpretation (B) of the documentation is correct - "if you have deleted a large part of ANY table OR if you have made many changes to a table with variable-length rows."
If your table has no VAR columns, each record, regardless of the data it contains, takes up the exact same amount of space in the table. If a record is deleted from the table, and the DB chooses to reuse the exact area the previous record was stored, it can do so without wasting any space or fragmenting your data.
As far as whether OPTIMIZE only improves performance on a query that utilizes the primary key index, that answer would almost certainly vary based on what storage engine is in use, and I'm afraid I wouldn't be able to answer that.
However, speaking of storage engines, if you do end up using OPTIMIZE, be aware that it doesn't like to run on InnoDB tables, so the command maps to ALTER and rebuilds the table, which might be a more expensive operation. Either way, the table locks during the optimizations, so be very careful about when you run it.
There are so many differences between MyISAM and InnoDB, I am splitting this answer in two:
MyISAM
FIXED has some meaning for MyISAM.
"Deleted rows are maintained in a linked list and subsequent INSERT operations reuse old row positions" applies to MyISAM, not InnoDB. Hence, for MyISAM tables with a lot of churn, OPTIMIZE can be beneficial.
In MyISAM, VAR plus DELETE/UPDATE leads to fragmentation.
Because of the linked list and VAR, a single row can be fragmented across the data file (.MYD). (Otherwise, a MyISAM row is contiguous in the data file.)
InnoDB
FIXED has no meaning for InnoDB tables.
For VAR in InnoDB, there are "block splits", not a linked list.
In a BTree, block splits stabilize at and average 69% full. So, with InnoDB, almost any abuse will leave the table not too bloated. That is, DELETE/UPDATE (with or without VAR) leads to the more limited BTree 'fragmentation'.
In InnoDB, emptied blocks (16KB each) are put on a "free list" for reuse; they are not given back to the OS.
Data in InnoDB is ordered by the PRIMARY KEY, so deleting a row in one part of the table does not provide space for a new row in another part of the table. But, when a block is freed up, it can be used elsewhere.
Two adjacent blocks that are half empty will be coalesced, thereby freeing up a block.
Both
If you are removing "old" data (your 10%), then PARTITIONing is a much better way to do it. See my blog. It involves DROP PARTITION, which is instantaneous and gives space back to the OS, plus REORGANIZE PARTITION, which can be instantaneous.
OPTIMIZE TABLE is almost never worth doing.
Related
I have a table such as follows:
CREATE TABLE Associations (
obj_id int unsigned NOT NULL,
attr_id int unsigned NOT NULL,
assignment Double NOT NULL
PRIMARY KEY (`obj_id`, `attr_id`),
);
Now the insertion order for the rows is/will be random. Would such a definition lead to fragmentation of the table? Should I be adding an auto inc primary key or would that only speed up the insert and would not help the speed of SELECT queries?
What would a better table definition be for random inserts?
Note, that performance wise I am more interested in SELECT than INSERT
(Assuming you are using ENGINE=InnoDB.)
Short answer: Do not fret about fragmentation.
Long answer:
There are two types of "fragmentation" -- Which one bothers you?
BTree blocks becoming less than full.
Blocks becoming scattered around the disk.
If you have an SSD disk, the scattering of blocks around the disk has no impact on performance. For HDD, it matters some, but still not enough to get very worried about.
Fragmentation does not "run away". If two adjacent blocks are seen to be relatively empty, they are combined. Result: The "average" block is about 69% full.
In your particular example, when you want multiple "attributes" for one "object", they will be found "clustered". That is they will be mostly in the same block, hence a bit faster to access. Adding id AUTO_INCREMENT PRIMARY KEY would slow SELECTs/UPDATEs down.
Another reason why an id would slow down SELECTs is that SELECT * FROM t WHERE obj_id=... needs to first find the item in the index, then reach into the data for the other columns. With PRIMARY KEY(obj_id, ...), there is no need for this extra hop. (In some situations, this is a big speedup.)
OPTIMIZE TABLE takes time and blocks access while you are running it.
Even after OPTIMIZE, fragmentation comes back -- for a variety of reasons.
"Fill factor" is virtually useless -- UPDATE and DELETE store extra copies of rows pending COMMIT. This leads to block splits (aka page splits) if fill_factor is too high or sparse blocks if too low. That is, it is too hard to be worth trying to tune.
Fewer indexes means less disk space, etc. You probably need an index on (obj_id, attr_id) whether or not you also have (id). So, why waste space when it does not help?
The one case where OPTIMIZE TABLE can make a noticeable difference is after you delete lots of rows. I discuss several ways to avoid this issue here: http://mysql.rjweb.org/doc.php/deletebig
I guess you use the InnoDB access method. InnoDB stores its data in a so-called clustered index. That is, all the data is stashed away behind the BTREE primary key.
Read this for background.
When you insert a row, you're inserting it into the BTREE structure. To oversimplify, BTREEs are made up of elaborately linked pages accessible in order. That means your data goes into some page somewhere. When you insert data in primary-key order, the data goes into a page at the end of the BTREE. So, when a page fills up, InnoDB just makes another one and puts your data there.
But, when you insert in some other order, often your row must go between other rows in an existing BTREE page. If the page has enough free space, InnoDB can drop your data into it. But, if the page does not have enough space, InnoDB must do a page split. It makes two pages from one, and puts your new row into one of the two.
Doing inserts in some order other than index order causes more page splits. That's why it doesn't perform as well. The classic example is building a table with a UUIDv4 (random) primary key column.
Now, you asked about autoincrementing primary keys. If you have such a key in your InnoDB table, all (or almost all) your INSERTs go into the last page of the clustered index, so you don't get the page split overhead. Cool.
But, if you need an index on some other column or columns that aren't in your INSERT order, you'll get page splits in that secondary index. The entries in secondary indexes are often smaller than the ones in clustered indexes, so you get fewer page splits. But you still get them.
Some DBMSs, but not MySQL, let you declare FILL_PERCENT(50) or something in both clustered and secondary indexes. That's useful for out-of-order loads because your can make your pages start out with less space already used, so you get fewer page splits. (Of course, you use more RAM and SSD with lower fill factors.)
MySQL doesn't have FILL_FACTOR in its data definition language. It does have a global systemwide variable called innodb_fill_factor. It is a percentage number. Its default is 100, which actually means 1/16th of each page is left unused.
If you know you have to do a big out-of-index-order bulk load you can give this
command first to leave 60% of each new page available, to reduce page splits.
SET GLOBAL innodb_fill_factor = 40;
But beware, this is a system-wide setting. It will apply to everything on your MySQL server. You might want to put it back when done to save RAM and SSD space in production.
Finally, OPTIMIZE TABLE tablename; can reorganize tables that have had a lot of page splits to clean them up. (In InnoDB the OPTIMIZE command actually maps to ALTER TABLE tablename FORCE; ANALYZE TABLE tablename;.) It can take a while, so beware.
When you OPTIMIZE, InnoDB remakes the pages to bring their fill percentages near to the number you set in the system variable.
Unless you're doing a really vast bulk load on a vast table, my advice is to not worry about all this fill percentage business. Design your table to match your application and don't look back.
When you're done with any bulk load you can, if you want, do OPTIMIZE TABLE to get rid of any gnarly page splits.
Edit Your choice of primary key is perfect for your queries' WHERE pattern obj_id IN (val, val, val). Don't change that primary key, especially not to an autoincrementing one.
Pro tip It's tempting to try to forsee scaling problems in the early days of an app's lifetime. And there's no harm in it. But in the case of SQL databases, it's really hard to forsee the actual query patterns that will emerge as your app scales up. Fortunately, SQL's designed so you can add and tweak indexes as you go. You don't have to achieve performance perfection on day 1. So, my advice: think about this issue, but avoid overthinking it. With respect, you're starting to overthink it.
Can someone point me in the right direction, i can't find any documentation on this behavior.
We know when you delete rows from a table you end up with "holes" which you can defrag with OPTIMIZE. Do new inserts automatically fill in those holes if left alone? Is there a way to force that behavior if not? Using InnoDB tables for revolving logs, deleting old rows and adding new, would the table roll over or continuously consume disk space? Or would a different engine be better suited for this?
Yes i know of table partitions, i want to explore all options first.
Since this is mostly a non-issue, I will assume you are asking for academic reasons?
InnoDB (you should be using that Engine!) stores the data (and each secondary index) in separate B+Trees.
The data's BTree is ordered by the PRIMARY KEY. The various leaf nodes will be filled to different degrees, based on the order of inserts, deletes, updates (that change the row length), temporary transactional locks on rows, etc, etc.
That last one is because one transaction sees effectively an instantaneous snapshot of the data, possibly different than another transaction's view. This implies that multiple copies of a row may coexist.
The buffer_pool holds 16KB blocks. Each block holds a variable number of rows. This number changes with the changing tides. If too adjacent blocks become "too empty", they will be combined.
Totally empty blocks (say, due to lots of deletes) will be put on a free chain for later reuse by Inserts. But note that the disk used by the table will not shrink.
The failure to shrink is usually not a problem -- most tables grow; any shrinkage is soon followed by a new growth spurt.
PARTITIONs are usually not worth using. However, that is the best way to "keep data for only 90 days", then use DROP PARTITION instead of a big, slow DELETE. (That is about the only use for PARTITION.)
If you add up all the bytes in the INTs (4 bytes each) VARCHARs (pick the average length), etc, etc, you will get what seems like a good estimate for the disk space being used. But due to the things discussed above, you need to multiply that number by 2 to 3 to a better estimate of the disk space actually consumed by the table.
I have a huge and very busy table (few thousands INSERT / second). The table stores loginlogs, it has a bigint ID which is not generated by MySQL but rather by pseudorandom generator on MySQL client.
Simply put, the table has loginlog_id, client_id, tons,of,other,columns,with,details,about,session....
I have few indexes on this table such as PRIMARY_KEY(loginlog_id) and INDEX(client_id)
In some other part of our system I need to fetch client_id based on loginlog_id. This does not happen that often (just few hundreds SELECT client_id FROM loginlogs WHERE loginlog_id=XXXXXX / second). Table loginlogs is read by various other scripts now and then, and always various columns are needed. But the most frequent call to read is for sure the above mentioned get client_id by loginlog_id.
My question is: should I create another table loginlogs_clientids and duplicate loginlog_id, client_id in there (this means another few thousands INSERTS, as for every loginlogs INSERT I get this new one). Or should I be happy with InnoDB handling my lookups by PRIMARY KEY efficiently.
We have tons of RAM (128GB, most of which is used by MySQL). Load of MySQL is between 40% and 350% CPU (we have 12 core CPU). When I tried to use the new table, I did not see any difference. But I am asking for the future, if our usage grows even more, what is the suggested approach? Duplicate or index?
Thanks!
No.
Looking up table data for a single row using the primary key is extremely efficient, and will take the same time for both tables.
Exceptions to that might be very large row sizes (e.g. 8KB+), and client_id is e.g. a varchar that is stored off-page, in which case you might need to read an additional data block, which at least theoretically could cost you some milliseconds.
Even if this strategy would have an advantage, you would not actually do it by creating a new table, but by adding an index (loginlog_id, client_id) to your original table. InnoDB stores everything, including the actual data, in an index structure, so that adding an index is basically the same as adding a new table with the same columns, but without (you) having the problem of synchronizing those two "tables".
Having a structure with a smaller row size can have some advantages for ranged scans, e.g. MySQL will evaluate select count(*) from tablename using the smallest index of the table, as it has to read less bytes. You already have such a small index (on client_id), so even in that regard, adding such an additonal table/index shouldn't have an effect. If you have any range scan on the primary key (which is probably unlikely for pseudorandom data), you may want to consider this though, or keep it in mind for cases when you have.
I have a database with a single table that keeps track of user state. When I'm done handling the row, its no longer necessary to keep it in the database and can be deleted.
Now lets say I wanted to keep track of the row instead of deleting it (for historical purposes, analytics, etc). Would it be better to:
Leave the data in the same table and mark the row as 'used' (with an extra column or something like that)
Delete the row from the table and insert it into a separate table that is created only for historical purposes
For choice #1, I wonder if leaving the unnecessary rows in the database will start to affect query performance. (All of my queries are on indexed columns, so maybe this doesn't matter?)
For choice #2, I wonder if the constant deleting of rows will end up causing problems such as fragmentation?
Query performance will be better in the long run:
What is happening with forever inserts:
The table grows, indexes grow, index performance (lookup) is decreases with the size of the table, especially insert performance is hurt.
What is happening with delete:
Table pages get fragmented, so the deleted space is not re-used 100% as expected, more near 50% in MySQL. So the table still grows to about twice the size you might expect for your amount of data. The index gets fragmented and becomes lob sided: It contains your new data but also the structure for your old data. It depends on the structure of your data on how bad this gets. This situation however stabilizes at a certain performance. This performance point has 2 benefits:
1) The table is more limited in size, so potential full table scans are faster
2) Your performance is predictable.
Due to the fragmentation however this performance point is not equal to about twice your data amount, it tends to be a bit worse (benchmark it to see yourself). The benefit of the delete scenario is however since you have a smaller data set, that you might be able to rebuild your index once every reasonable period, thus improving your performance.
Alternatives
There are two alternatives you can look at to improve performance:
Switch to MariaDB: This gains about 8% performance on large datasets (my observation, dataset just about 200GB compressed data)
Look at partitioning: If you have a handy partitioning parameter, you can create a series of "small tables" for you and prevent logic for delete, rebuild and historic data management. This might give you the best performance profile.
If most of the table is flagged as deleted, you will be stumbling over them as you look for the non-deleted records. Adding is_deleted to many of the indexes is likely to help.
If you are deleting records purely on age, then PARTITION BY RANGE(TO_DAYS(...)) is an excellent way to build the table. The DROP TABLE is instantaneous and the ALTER TABLE ... REORGANIZE ... to create a new week (or month or ...) partition is also instantaneous. See my blog for details.
If you "move" records to another table, then the table will not shrink very fast due to fragmentation. If you have enough disk space, this is not a bug deal. If some queries need to see both current and archived records, use UNION ALL; it is pretty easy and efficient.
What is the maximum size for a MySQL table? Is it 2 million at 50GB? 5 million at 80GB?
At the higher end of the size scale, do I need to think about compressing the data? Or perhaps splitting the table if it grew too big?
I once worked with a very large (Terabyte+) MySQL database. The largest table we had was literally over a billion rows.
It worked. MySQL processed the data correctly most of the time. It was extremely unwieldy though.
Just backing up and storing the data was a challenge. It would take days to restore the table if we needed to.
We had numerous tables in the 10-100 million row range. Any significant joins to the tables were too time consuming and would take forever. So we wrote stored procedures to 'walk' the tables and process joins against ranges of 'id's. In this way we'd process the data 10-100,000 rows at a time (Join against id's 1-100,000 then 100,001-200,000, etc). This was significantly faster than joining against the entire table.
Using indexes on very large tables that aren't based on the primary key is also much more difficult. Mysql stores indexes in two pieces -- it stores indexes (other than the primary index) as indexes to the primary key values. So indexed lookups are done in two parts: First MySQL goes to an index and pulls from it the primary key values that it needs to find, then it does a second lookup on the primary key index to find where those values are.
The net of this is that for very large tables (1-200 Million plus rows) indexing against tables is more restrictive. You need fewer, simpler indexes. And doing even simple select statements that are not directly on an index may never come back. Where clauses must hit indexes or forget about it.
But all that being said, things did actually work. We were able to use MySQL with these very large tables and do calculations and get answers that were correct.
About your first question, the effective maximum size for the database is usually determined by operating system, specifically the file size MySQL Server will be able to create, not by MySQL Server itself. Those limits play a big role in table size limits. And MyISAM works differently from InnoDB. So any tables will be dependent on those limits.
If you use InnoDB you will have more options on manipulating table sizes, resizing the tablespace is an option in this case, so if you plan to resize it, this is the way to go. Give a look at The table is full error page.
I am not sure the real record quantity of each table given all necessary information (OS, Table type, Columns, data type and size of each and etc...) And I am not sure if this info is easy to calculate, but I've seen simple table with around 1bi records in a couple cases and MySQL didn't gave up.