Post optimization needed after deleting rows in a MYSQL Database - mysql

I have a log table that is currently 10GB. It has a lot of data for the past 2 years, and I really feel at this point I don't need so much in there. Am I wrong to assume it is not good to have years of data in a table (a smaller table is better)?
My tables all have an engine of MYISAM.
I would like to delete all data of 2014 and 2015, and soon i'll do 2016, but i'm concerned about after I run the DELETE statement, what exactly will happen. I understand because it's ISAM there is a lock that will occur where no writing can take place? I would probably delete data by the month, and do it late at night, to minimize this as it's a production DB.
My prime interest, specifically, is this: should I take some sort of action after this deletion? Do I need to manually tell MYSQL to do anything to my table, or is MYSQL going to do all the housekeeping itself, reclaiming everything, reindexing, and ultimately optimizing my table after the 400,000k records I'll be deleting.
Thanks everyone!

Plan A: Use a time-series PARTITIONing of the table so that future deletions are 'instantaneous' because of DROP PARTITION. More discussion here . Partitioning only works if you will be deleting all rows older than X.
Plan B: To avoid lengthy locking, chunk the deletes. See here . This is optionally followed by an OPTIMIZE TABLE to reclaim space.
Plan C: Simply copy over what you want to keep, then abandon the rest. This is especially good if you need to preserve only a small proportion of the table.
CREATE TABLE new LIKE real;
INSERT INTO new
SELECT * FROM real
WHERE ... ; -- just the newer rows;
RENAME TABLE real TO old, new TO real; -- instantaneous and atomic
DROP TABLE old; -- after verifying that all went well.
Note: The .MYD file contains the data; it will never shrink. Deletes will leave holes in it. Further inserts (and opdates) will use the holes in preference to growing the table. Plans A and C (but not B) will avoid the holes, and truly free up space.

Tim and e4c5 have given some good recommendations and I urge them to add their answers.
You can run OPTIMIZE TABLE after doing the deletes. Optimize table will help you with a few things (taken from the docs):
If the table has deleted or split rows, repair the table.
If the index pages are not sorted, sort them.
If the table's statistics are not up to date (and the repair could not be accomplished by sorting the index), update them.
According to the docs: http://dev.mysql.com/doc/refman/5.7/en/optimize-table.html
Use OPTIMIZE TABLE in these cases, depending on the type of table:
...
After deleting a large part of a MyISAM or ARCHIVE table, or making
many changes to a MyISAM or ARCHIVE table with variable-length rows
(tables that have VARCHAR, VARBINARY, BLOB, or TEXT columns). Deleted
rows are maintained in a linked list and subsequent INSERT operations
reuse old row positions. You can use OPTIMIZE TABLE to reclaim the
unused space and to defragment the data file. After extensive changes
to a table, this statement may also improve performance of statements
that use the table, sometimes significantly.

Related

Analyze + Optimize on InnoDB Tables

Back then when i was working heavily with MyISAM Tables i always had a cronjob which ran
~# mysqlanalyze -o database
I know that MyISAM benefit from this in certain ways e.g.: fragmentation and whatnot
Now, when running the same command on a databse where the majority of tables is InnoDB i wonder if this "does any good" to the tables and is considered a good practice to do so every now and then or if its rather counter productive. Reading alot of :
Table does not support optimize, doing recreate + analyze instead
Which sounds expensive with regards to Disk IO / CPU time ?!
would appreciate some input on this.
https://dev.mysql.com/doc/refman/8.0/en/optimize-table.html says:
For InnoDB tables, OPTIMIZE TABLE is mapped to ALTER TABLE ... FORCE, which rebuilds the table to update index statistics and free unused space in the clustered index.
This does do some good in cases when you had too much fragmentation. Pages will be filled more efficiently, indexes will be rebuilt, and disk space occupied by the table will be reduced if you use innodb_file_per_table (which is the default in recent versions).
It does take time, depending on the size of your table. It will lock the table while it's running. It will require extra disk space while it's running, as it creates a copy of the table.
Doing optimize table on an InnoDB table is usually not necessary to do frequently, but only after you do a lot of insert/update/delete against the table in a way that could result in fragmentation.
ANALYZE TABLE is much less impact for InnoDB. This doesn't require building a copy of the table. It's a read-only action, it just reads a random sample of pages from the table and uses that to estimate the number of rows, average size of rows, and it update statistics about the indexes, to guide the query optimizer. This is safe to run anytime, it will lock that table for moment, but that won't be any greater regardless of the size of the table.
Don't bother. InnoDB almost never needs either ANALYZE or OPTIMIZE; don't waste your time unless you have identified a need.
An exception is a FULLTEXT index on an InnoDB table. Such can benefit from DROP INDEX, then ADD INDEX.
If you are "reloading" the table from new data, then the following avoids downtime:
CREATE TABLE new LIKE real;
load `new`
RENAME TABLE real TO old, new TO real; -- fast, atomic
DROP TABLE old;
(Caveat: The above technique probably has issues if there are FOREIGN KEYS.)

How to remove large data from MySQL?

I have 50 GB of data in a table, and have to remove it if the records are older than a particular date, after taking its backup.
Currently i follow the following steps:
Take backup of complete table.
Run a delete query with where clause for removing the non required data as:
DELETE FROM <some-table-name> WHERE `creation_time` <= '<some-valid-time>'
Problem with the current approach are:
It is painfully slow.
Redundant storage of data, when only incremental data is required; due to the backup is taken of whole table but removal of only selective records are done.
After deletion the disk space is not returned back to the OS (until optimization is done).
I thought of breaking that table into smaller tables for weekly/monthly basis which would enable easy backup and deletion, but query them together will be very difficult and slow.
Please advice some smart and efficient way to do this.
Use the creation_time as a partitioning key, make per-week or per-month partitions. Dropping old partitions is incredibly fast.

Inserting New Column in MYSQL taking too long

We have a huge database and inserting a new column is taking too long. Anyway to speed up things?
Unfortunately, there's probably not much you can do. When inserting a new column, MySQL makes a copy of the table and inserts the new data there. You may find it faster to do
CREATE TABLE new_table LIKE old_table;
ALTER TABLE new_table ADD COLUMN (column definition);
INSERT INTO new_table(old columns) SELECT * FROM old_table;
RENAME table old_table TO tmp, new_table TO old_table;
DROP TABLE tmp;
This hasn't been my experience, but I've heard others have had success. You could also try disabling indices on new_table before the insert and re-enabling later. Note that in this case, you need to be careful not to lose any data which may be inserted into old_table during the transition.
Alternatively, if your concern is impacting users during the change, check out pt-online-schema-change which makes clever use of triggers to execute ALTER TABLE statements while keeping the table being modified available. (Note that this won't speed up the process however.)
There are four main things that you can do to make this faster:
If using innodb_file_per_table the original table may be highly fragmented in the filesystem, so you can try defragmenting it first.
Make the buffer pool as big as sensible, so more of the data, particularly the secondary indexes, fits in it.
Make innodb_io_capacity high enough, perhaps higher than usual, so that insert buffer merging and flushing of modified pages will happen more quickly. Requires MySQL 5.1 with InnoDB plugin or 5.5 and later.
MySQL 5.1 with InnoDB plugin and MySQL 5.5 and later support fast alter table. One of the things that makes a lot faster is adding or rebuilding indexes that are both not unique and not in a foreign key. So you can do this:
A. ALTER TABLE ADD your column, DROP your non-unique indexes that aren't in FKs.
B. ALTER TABLE ADD back your non-unique, non-FK indexes.
This should provide these benefits:
a. Less use of the buffer pool during step A because the buffer pool will only need to hold some of the indexes, the ones that are unique or in FKs. Indexes are randomly updated during this step so performance becomes much worse if they don't fully fit in the buffer pool. So more chance of your rebuild staying fast.
b. The fast alter table rebuilds the index by sorting the entries then building the index. This is faster and also produces an index with a higher page fill factor, so it'll be smaller and faster to start with.
The main disadvantage is that this is in two steps and after the first one you won't have some indexes that may be required for good performance. If that is a problem you can try the copy to a new table approach, using just the unique and FK indexes at first for the new table, then adding the non-unique ones later.
It's only in MySQL 5.6 but the feature request in http://bugs.mysql.com/bug.php?id=59214 increases the speed with which insert buffer changes are flushed to disk and limits how much space it can take in the buffer pool. This can be a performance limit for big jobs. the insert buffer is used to cache changes to secondary index pages.
We know that this is still frustratingly slow sometimes and that a true online alter table is very highly desirable
This is my personal opinion. For an official Oracle view, contact an Oracle public relations person.
James Day, MySQL Senior Principal Support Engineer, Oracle
usually new line insert means that there are many indexes.. so I would suggest reconsidering indexing.
Michael's solution may speed things up a bit, but perhaps you should have a look at the database and try to break the big table into smaller ones. Take a look at this: link. Normalizing your database tables may save you loads of time in the future.

Generating a massive 150M-row MySQL table

I have a C program that mines a huge data source (20GB of raw text) and generates loads of INSERTs to execute on simple blank table (4 integer columns with 1 primary key). Setup as a MEMORY table, the entire task completes in 8 hours. After finishing, about 150 million rows exist in the table. Eight hours is a completely-decent number for me. This is a one-time deal.
The problem comes when trying to convert the MEMORY table back into MyISAM so that (A) I'll have the memory freed up for other processes and (B) the data won't be killed when I restart the computer.
ALTER TABLE memtable ENGINE = MyISAM
I've let this ALTER TABLE query run for over two days now, and it's not done. I've now killed it.
If I create the table initially as MyISAM, the write speed seems terribly poor (especially due to the fact that the query requires the use of the ON DUPLICATE KEY UPDATE technique). I can't temporarily turn off the keys. The table would become over 1000 times larger if I were to and then I'd have to reprocess the keys and essentially run a GROUP BY on 150,000,000,000 rows. Umm, no.
One of the key constraints to realize: The INSERT query UPDATEs records if the primary key (a hash) exists in the table already.
At the very beginning of an attempt at strictly using MyISAM, I'm getting a rough speed of 1,250 rows per second. Once the index grows, I imagine this rate will tank even more.
I have 16GB of memory installed in the machine. What's the best way to generate a massive table that ultimately ends up as an on-disk, indexed MyISAM table?
Clarification: There are many, many UPDATEs going on from the query (INSERT ... ON DUPLICATE KEY UPDATE val=val+whatever). This isn't, by any means, a raw dump problem. My reasoning for trying a MEMORY table in the first place was for speeding-up all the index lookups and table-changes that occur for every INSERT.
If you intend to make it a MyISAM table, why are you creating it in memory in the first place? If it's only for speed, I think the conversion to a MyISAM table is going to negate any speed improvement you get by creating it in memory to start with.
You say inserting directly into an "on disk" table is too slow (though I'm not sure how you're deciding it is when your current method is taking days), you may be able to turn off or remove the uniqueness constraints and then use a DELETE query later to re-establish uniqueness, then re-enable/add the constraints. I have used this technique when importing into an INNODB table in the past, and found even with the later delete it was overall much faster.
Another option might be to create a CSV file instead of the INSERT statements, and either load it into the table using LOAD DATA INFILE (I believe that is faster then the inserts, but I can't find a reference at present) or by using it directly via the CSV storage engine, depending on your needs.
Sorry to keep throwing comments at you (last one, probably).
I just found this article which provides an example of a converting a large table from MyISAM to InnoDB, while this isn't what you are doing, he uses an intermediate Memory table and describes going from memory to InnoDB in an efficient way - Ordering the table in memory the way that InnoDB expects it to be ordered in the end. If you aren't tied to MyISAM it might be worth a look since you already have a "correct" memory table built.
I don't use mysql but use SQL server and this is the process I use to handle a file of similar size. First I dump the file into a staging table that has no constraints. Then I identify and delete the dups from the staging table. Then I search for existing records that might match and put the idfield into a column in the staging table. Then I update where the id field column is not null and insert where it is null. One of the reasons I do all the work of getting rid of the dups in the staging table is that it means less impact on the prod table when I run it and thus it is faster in the end. My whole process runs in less than an hour (and actually does much more than I describe as I also have to denormalize and clean the data) and affects production tables for less than 15 minutes of that time. I don't have to wrorry about adjusting any constraints or dropping indexes or any of that since I do most of my processing before I hit the prod table.
Consider if a simliar process might work better for you. Also could you use some sort of bulk import to get the raw data into the staging table (I pull the 22 gig file I have into staging in around 16 minutes) instead of working row-by-row?

Oracle performance of schema changes as compared to MySQL ALTER TABLE?

When using MySQL MyISAM tables, and issuing an ALTER TABLE statement to add a column, MySQL creates a temporary table and copies all the data into the new table before overwriting the original table.
If that table has a lot of data, this process can be very slow (especially when rebuilding indexes), and requires you to have enough free space on the disk to store 2 copies of the table. This is very annoying.
How does Oracle work when adding columns? Is it fast on large tables?
I'm always interested in being able to do schema changes without having a lot of downtime. We are always adding new features to our software which require schema changes with every release. Any advice is appreciated...
Adding a column with no data to a large table in Oracle is generally very fast. There is no temporary copy of the data and no need to rebuild indexes. Slowness will generally arise when you want to add a column to a large table and backfill data into that new column for all the existing rows, since now you're talking about an UPDATE statement that affects a large number of rows.
Adding columns can lead to row migration over time. If you have a block that is 80% full with 4 rows and you add columns that will grow the size of each row 30% over time, you'll eventually reach a point where Oracle has to move one of the 4 rows to a different block. It does this by leaving a pointer to the new block in the old block, which causes reads on that migrated row to require more I/O. Eliminating migrated rows can be somewhat costly, and though it is generally possible to do without downtime assuming you're using the enterprise edition, it is generally easier if you have a bit of downtime. But row migration is something that you generally only have to worry about well down the road. If you know that certain tables are likely to have their row size increase substantially in the future, you can mitigate problems in advance by specifying a larger PCTFREE setting for the table.
With regard to downtime, altering a table (and a bunch of other DDL operations) take an exclusive lock. However Oracle can also perform online redefinition of oobjects using the DBMS_REDEFINITION package, which can really take a bite out of downtime.