I wanted to add 8 new columns to a large mysql(version 5.6) table with innodb having millions of record. I am trying to achieve this in most optimised way.
Is there any advantage of using a single query to adding all columns over adding 8 columns in 8 different queries. If so would like to know why.
On specifying ALGORITHM=INPLACE, LOCK=NONE, What all things i need to take care so that it wont cause any data corruption or application failure!
I was testing out ALGORITHM=INPLACE, LOCK=NONE with the query.
ALTER TABLE table_test ADD COLUMN test_column TINYINT UNSIGNED DEFAULT 0 ALGORITHM=INPLACE LOCK = NONE;
But its taking same time as the query ran with ALGORITHM=DEFAULT. What can be the reason.
Table which im altering is having only primary key index and no other indexes. From application the queries coming to this table are:
insert into table;
select * from table where user_id=uid;
select sum(column) from table where user_id=id and date<NOW();
By "optimized", do you mean "fastest"? Or "least impact on other queries"?
In older versions, the optimal way (using no add-ons) was to put all the ADD COLUMNs in a single ALTER TABLE; then wait until it finishes.
In any version, pt-online-schema-change will add all the columns with only a brief downtime.
Since you mention ALGORITHM=INPLACE, LOCK=NONE, I assume you are using a newer version? So, it may be that 8 ALTERs is optimal. There would be some interference, but perhaps not "too much".
ALGORITHM=DEFAULT lets the server pick the "best". This is almost always really the "best". That is, there is rarely a need to say anything other than DEFAULT.
You can never get data corruption. At worst, a query may fail due to some kind of timeout due to the interference of the ALTER(s). You should always be checking for error (including timeouts), and take handle it in your app.
To discuss the queries...
insert into table;
One row at a time? Or batched? (Batched is more efficient -- perhaps 10x better.)
select * from table;
Surely not! That would give you all the columns for millions of rows. Why should you ever do that?
select count(column) from table where pk=id and date<NOW();
COUNT(col) checks col for being NOT NULL -- Do you need that? If not, then simply do COUNT(*).
WHERE pk=id gives you only one row; so why also qualify with date<NOW()? The PRIMARY KEY makes the query as fast as possible.
The only index is PRIMARY KEY? This seems unusual for a million-row table. Is it a "Fact" table in a "Data Warehouse" app?
Internals
(Caveat: Much of this discussion of Internals is derived indirectly, and could be incorrect.)
For some ALTERs, the work is essentially just in the schema. Eg: Adding options on the end of an ENUM; increasing the size of a VARCHAR.
For some ALTERs with INPLACE, the processing is essentially modifying the data in place -- without having to copy it. Eg: Adding a column at the end.
PRIMARY KEY changes (in InnoDB) necessarily involve rebuilding the BTree containing the data; they cannot be done INPLACE.
Many secondary INDEX operations can be done without touching (other than reading) the data. DROP INDEX throws away a BTree and makes some meta changes. ADD INDEX reads the entire table, building the index BTree on the side, then announcing its existence. CHARACTER SET and COLLATION changes require rebuilding an index.
If the table must be copied over, there is a significant lock on the table. Any ALTER that needs to read all the data has an indirect impact because of the I/O and/or CPU and/or brief locks on blocks/rows/etc.
It is unclear whether the code is smart enough to handle a multi-task ALTER in the most efficient way. Adding 8 columns in one INPLACE pass should be possible, but if it made the code too complex, that operation may be converted to COPY.
Probably a multi-task ALTER will do the 'worst' case. For example, changing the PRIMARY KEY and augmenting an ENUM will simply do both in a single COPY. Since COPY is the original way of doing all ALTERs, it is well debugged and optimized by now. (But it is slow and invasive.)
COPY is really quite simple to implement, mostly involving existing primitives:
Lock real so no one is writing to it
CREATE TABLE new LIKE real;
ALTER TABLE new ... -- whatever you asked for
copy all the rows from real to new -- this is the slow part
RENAME TABLE real TO old, new TO real; -- fast, atomic, etc.
Unlock
DROP TABLE old;
INPLACE is more complex because it must decide among many different algorithms and locking levels. DEFAULT has to punt off to COPY if it cannot do INPLACE.
Related
I have a log table that is currently 10GB. It has a lot of data for the past 2 years, and I really feel at this point I don't need so much in there. Am I wrong to assume it is not good to have years of data in a table (a smaller table is better)?
My tables all have an engine of MYISAM.
I would like to delete all data of 2014 and 2015, and soon i'll do 2016, but i'm concerned about after I run the DELETE statement, what exactly will happen. I understand because it's ISAM there is a lock that will occur where no writing can take place? I would probably delete data by the month, and do it late at night, to minimize this as it's a production DB.
My prime interest, specifically, is this: should I take some sort of action after this deletion? Do I need to manually tell MYSQL to do anything to my table, or is MYSQL going to do all the housekeeping itself, reclaiming everything, reindexing, and ultimately optimizing my table after the 400,000k records I'll be deleting.
Thanks everyone!
Plan A: Use a time-series PARTITIONing of the table so that future deletions are 'instantaneous' because of DROP PARTITION. More discussion here . Partitioning only works if you will be deleting all rows older than X.
Plan B: To avoid lengthy locking, chunk the deletes. See here . This is optionally followed by an OPTIMIZE TABLE to reclaim space.
Plan C: Simply copy over what you want to keep, then abandon the rest. This is especially good if you need to preserve only a small proportion of the table.
CREATE TABLE new LIKE real;
INSERT INTO new
SELECT * FROM real
WHERE ... ; -- just the newer rows;
RENAME TABLE real TO old, new TO real; -- instantaneous and atomic
DROP TABLE old; -- after verifying that all went well.
Note: The .MYD file contains the data; it will never shrink. Deletes will leave holes in it. Further inserts (and opdates) will use the holes in preference to growing the table. Plans A and C (but not B) will avoid the holes, and truly free up space.
Tim and e4c5 have given some good recommendations and I urge them to add their answers.
You can run OPTIMIZE TABLE after doing the deletes. Optimize table will help you with a few things (taken from the docs):
If the table has deleted or split rows, repair the table.
If the index pages are not sorted, sort them.
If the table's statistics are not up to date (and the repair could not be accomplished by sorting the index), update them.
According to the docs: http://dev.mysql.com/doc/refman/5.7/en/optimize-table.html
Use OPTIMIZE TABLE in these cases, depending on the type of table:
...
After deleting a large part of a MyISAM or ARCHIVE table, or making
many changes to a MyISAM or ARCHIVE table with variable-length rows
(tables that have VARCHAR, VARBINARY, BLOB, or TEXT columns). Deleted
rows are maintained in a linked list and subsequent INSERT operations
reuse old row positions. You can use OPTIMIZE TABLE to reclaim the
unused space and to defragment the data file. After extensive changes
to a table, this statement may also improve performance of statements
that use the table, sometimes significantly.
This is the query I'm trying to execute:
DROP INDEX id_index on table;
I've been able to drop indexes quickly in the past, but this query ran for almost an hour before I gave up on it. What could be causing this slow pace?
SHOW CREATE TABLE -- If it says ENGINE=MyISAM, the DROP is performed this way:
Copy the entire table data over into a temp (slow due to lots of I/O)
Rebuild all the remaining indexes (very slow, in some cases)
Rename to replace the existing table (always fast)
This can be very slow, depending on the size of the table. This is because of all the disk I/O.
If it says ENGINE=InnoDB, things could be better. But it still matters whether you are DROPping the PRIMARY KEY or not. And possibly whether the KEY is involved in a FOREIGN KEY constraint. I assume old_alter_table is set to OFF.
http://dev.mysql.com/doc/refman/5.6/en/alter-table.html
has a lot of details. What you needed to say was ALGORITHM=INPLACE. You probably got ALGORITHM=DEFAULT, and I don't see in the doc what the default is.
ALGORITHM=COPY acts like I mentioned above for MyISAM.
ALGORITHM=INPLACE should take very little time, regardless of the table/index size.
(Be sure to check the details of ALTER for whichever version you are running. There have been several significant changes in recent major versions.)
We have a huge database and inserting a new column is taking too long. Anyway to speed up things?
Unfortunately, there's probably not much you can do. When inserting a new column, MySQL makes a copy of the table and inserts the new data there. You may find it faster to do
CREATE TABLE new_table LIKE old_table;
ALTER TABLE new_table ADD COLUMN (column definition);
INSERT INTO new_table(old columns) SELECT * FROM old_table;
RENAME table old_table TO tmp, new_table TO old_table;
DROP TABLE tmp;
This hasn't been my experience, but I've heard others have had success. You could also try disabling indices on new_table before the insert and re-enabling later. Note that in this case, you need to be careful not to lose any data which may be inserted into old_table during the transition.
Alternatively, if your concern is impacting users during the change, check out pt-online-schema-change which makes clever use of triggers to execute ALTER TABLE statements while keeping the table being modified available. (Note that this won't speed up the process however.)
There are four main things that you can do to make this faster:
If using innodb_file_per_table the original table may be highly fragmented in the filesystem, so you can try defragmenting it first.
Make the buffer pool as big as sensible, so more of the data, particularly the secondary indexes, fits in it.
Make innodb_io_capacity high enough, perhaps higher than usual, so that insert buffer merging and flushing of modified pages will happen more quickly. Requires MySQL 5.1 with InnoDB plugin or 5.5 and later.
MySQL 5.1 with InnoDB plugin and MySQL 5.5 and later support fast alter table. One of the things that makes a lot faster is adding or rebuilding indexes that are both not unique and not in a foreign key. So you can do this:
A. ALTER TABLE ADD your column, DROP your non-unique indexes that aren't in FKs.
B. ALTER TABLE ADD back your non-unique, non-FK indexes.
This should provide these benefits:
a. Less use of the buffer pool during step A because the buffer pool will only need to hold some of the indexes, the ones that are unique or in FKs. Indexes are randomly updated during this step so performance becomes much worse if they don't fully fit in the buffer pool. So more chance of your rebuild staying fast.
b. The fast alter table rebuilds the index by sorting the entries then building the index. This is faster and also produces an index with a higher page fill factor, so it'll be smaller and faster to start with.
The main disadvantage is that this is in two steps and after the first one you won't have some indexes that may be required for good performance. If that is a problem you can try the copy to a new table approach, using just the unique and FK indexes at first for the new table, then adding the non-unique ones later.
It's only in MySQL 5.6 but the feature request in http://bugs.mysql.com/bug.php?id=59214 increases the speed with which insert buffer changes are flushed to disk and limits how much space it can take in the buffer pool. This can be a performance limit for big jobs. the insert buffer is used to cache changes to secondary index pages.
We know that this is still frustratingly slow sometimes and that a true online alter table is very highly desirable
This is my personal opinion. For an official Oracle view, contact an Oracle public relations person.
James Day, MySQL Senior Principal Support Engineer, Oracle
usually new line insert means that there are many indexes.. so I would suggest reconsidering indexing.
Michael's solution may speed things up a bit, but perhaps you should have a look at the database and try to break the big table into smaller ones. Take a look at this: link. Normalizing your database tables may save you loads of time in the future.
i have a table with about 200,000 records. i want to add a field to it:
ALTER TABLE `table` ADD `param_21` BOOL NOT NULL COMMENT 'about the field' AFTER `param_20`
but it seems a very heavy query and it takes a very long time, even on my Quad amd PC with 4GB of RAM.
i am running under windows/xampp and phpMyAdmin.
does mysql have a business with every record when adding a field?
or can i change the query so it makes the change more quickly?
MySQL will, in almost all cases, rebuild the table during an ALTER**. This is because the row-based engines (i.e. all of them) HAVE to do this to retain the data in the right format for querying. It's also because there are many other changes you could make which would also require rebuilding the table (such as changing indexes, primary keys etc)
I don't know what engine you're using, but I will assume MyISAM. MyISAM copies the data file, making any necessary format changes - this is relatively quick and is not likely to take much longer than the IO hardware can get the old datafile in and the new on out to disc.
Rebuilding the indexes is really the killer. Depending on how you have it configured, MySQL will either: for each index, put the indexed columns into a filesort buffer (which may be in memory but is typically on disc), sort that using its filesort() function (which does a quicksort by recursively copying the data between two files, if it's too big for memory) and then build the entire index based on the sorted data.
If it can't do the filesort trick, it will just behave as if you did an INSERT on every row, and populate the index blocks with each row's data in turn. This is painfully slow and results in far from optimal indexes.
You can tell which it's doing by using SHOW PROCESSLIST during the process. "Repairing by filesort" is good, "Repairing with keycache" is bad.
All of this will use AT MOST one core, but will sometimes be IO bound as well (especially copying the data file).
** There are some exceptions, such as dropping secondary indexes on innodb plugin tables.
You add a NOT NULL column, the tuples need to be populated. So it will be slow...
This touches each of 200.000 records, as each record needs to be updated with a new bool value which is not going to be null.
So; yes it's an expensive query... There is nothing you can do to make it faster.
I have a C program that mines a huge data source (20GB of raw text) and generates loads of INSERTs to execute on simple blank table (4 integer columns with 1 primary key). Setup as a MEMORY table, the entire task completes in 8 hours. After finishing, about 150 million rows exist in the table. Eight hours is a completely-decent number for me. This is a one-time deal.
The problem comes when trying to convert the MEMORY table back into MyISAM so that (A) I'll have the memory freed up for other processes and (B) the data won't be killed when I restart the computer.
ALTER TABLE memtable ENGINE = MyISAM
I've let this ALTER TABLE query run for over two days now, and it's not done. I've now killed it.
If I create the table initially as MyISAM, the write speed seems terribly poor (especially due to the fact that the query requires the use of the ON DUPLICATE KEY UPDATE technique). I can't temporarily turn off the keys. The table would become over 1000 times larger if I were to and then I'd have to reprocess the keys and essentially run a GROUP BY on 150,000,000,000 rows. Umm, no.
One of the key constraints to realize: The INSERT query UPDATEs records if the primary key (a hash) exists in the table already.
At the very beginning of an attempt at strictly using MyISAM, I'm getting a rough speed of 1,250 rows per second. Once the index grows, I imagine this rate will tank even more.
I have 16GB of memory installed in the machine. What's the best way to generate a massive table that ultimately ends up as an on-disk, indexed MyISAM table?
Clarification: There are many, many UPDATEs going on from the query (INSERT ... ON DUPLICATE KEY UPDATE val=val+whatever). This isn't, by any means, a raw dump problem. My reasoning for trying a MEMORY table in the first place was for speeding-up all the index lookups and table-changes that occur for every INSERT.
If you intend to make it a MyISAM table, why are you creating it in memory in the first place? If it's only for speed, I think the conversion to a MyISAM table is going to negate any speed improvement you get by creating it in memory to start with.
You say inserting directly into an "on disk" table is too slow (though I'm not sure how you're deciding it is when your current method is taking days), you may be able to turn off or remove the uniqueness constraints and then use a DELETE query later to re-establish uniqueness, then re-enable/add the constraints. I have used this technique when importing into an INNODB table in the past, and found even with the later delete it was overall much faster.
Another option might be to create a CSV file instead of the INSERT statements, and either load it into the table using LOAD DATA INFILE (I believe that is faster then the inserts, but I can't find a reference at present) or by using it directly via the CSV storage engine, depending on your needs.
Sorry to keep throwing comments at you (last one, probably).
I just found this article which provides an example of a converting a large table from MyISAM to InnoDB, while this isn't what you are doing, he uses an intermediate Memory table and describes going from memory to InnoDB in an efficient way - Ordering the table in memory the way that InnoDB expects it to be ordered in the end. If you aren't tied to MyISAM it might be worth a look since you already have a "correct" memory table built.
I don't use mysql but use SQL server and this is the process I use to handle a file of similar size. First I dump the file into a staging table that has no constraints. Then I identify and delete the dups from the staging table. Then I search for existing records that might match and put the idfield into a column in the staging table. Then I update where the id field column is not null and insert where it is null. One of the reasons I do all the work of getting rid of the dups in the staging table is that it means less impact on the prod table when I run it and thus it is faster in the end. My whole process runs in less than an hour (and actually does much more than I describe as I also have to denormalize and clean the data) and affects production tables for less than 15 minutes of that time. I don't have to wrorry about adjusting any constraints or dropping indexes or any of that since I do most of my processing before I hit the prod table.
Consider if a simliar process might work better for you. Also could you use some sort of bulk import to get the raw data into the staging table (I pull the 22 gig file I have into staging in around 16 minutes) instead of working row-by-row?