Converting a big MyISAM to InnoDB - mysql

I'm trying to convert a 10million rows MySQL MyISAM table into InnoDB.
I tried ALTER TABLE but that made my server get stuck so I killed the mysql manually. What is the recommended way to do so?
Options I've thought about:
1. Making a new table which is InnoDB and inserting parts of the data each time.
2. Dumping the table into a text file and then doing LOAD FILE
3. Trying again and just keep the server non-responsive till he finishes (I tried for 2hours and the server is a production server so I prefer to keep it running)
4. Duplicating the table, Removing its indexes, then converting, and then adding indexes

Changing the engine of the table requires rewrite of the table, and that's why the table is not available for so long. Removing indexes, then converting, and adding indexes, may speed up the initial convert, but adding index creates a read lock on your table, so the effect in the end will be the same. Making new table and transferring the data is the way to go. Usually this is done in 2 parts - first copy records, then replay any changes that were done while copying the records. If you can afford disabling inserts/updates in the table, while leaving the reads, this is not a problem. If not, there are several possible solutions. One of them is to use facebook's online schema change tool. Another option is to set the application to write in both tables, while migrating the records, than switch only to the new record. This depends on the application code and crucial part is handling unique keys / duplicates, as in the old table you may update record, while in the new you need to insert it. (here transaction isolation level may also play crucial role, lower it as much as you can). "Classic" way is to use replication, which, as far as I know is also done in 2 parts - you start replication, recording the master position, then import dump of the database in the second server, then start it as a slave to catch up with changes.

Have you tried to order your data first by the PK ? e.g:
ALTER TABLE tablename ORDER BY PK_column;
should speed up the conversion.

Related

which is the better way to change the character set for huge data tables?

In my production database Alerts related tables are created with default CharSet of "latin", due to this we are getting error when we try
to insert Japanese characters in the table. We need to change the table and columns default charset to UTF8.
As these tables are having huge data, Alter command might take so much time (it took 5hrs in my local DB with same amount of data)
and lock the table which will cause data loss. Can we plan a mechanism to change the Charset to UTF8, without data loss.
which is the better way to change the charset for huge data tables?
I found this on mysql manual http://dev.mysql.com/doc/refman/5.1/en/alter-table.html:
In most cases, ALTER TABLE makes a temporary copy of the original
table. MySQL waits for other operations that are modifying the table,
then proceeds. It incorporates the alteration into the copy, deletes
the original table, and renames the new one. While ALTER TABLE is
executing, the original table is readable by other sessions. Updates
and writes to the table that begin after the ALTER TABLE operation
begins are stalled until the new table is ready, then are
automatically redirected to the new table without any failed updates
So yes -- it's tricky to minimize downtime while doing this. It depends on the usage profile of your table, are there more reads/writes?
One approach I can think of is to use some sort of replication. So create a new Alert table that uses UTF-8, and find a way to replicate original table to the new one without affecting availability / throughput. When the replication is complete (or close enough), switch the table by renaming it ?
Ofcourse this is easier said than done -- need more learning if it's even possible.
You may take a look into Percona Toolkit::online-chema-change tool:
pt-online-schema-change
It does exactly this - "alters a table’s structure without blocking reads or writes" - with some
limitations(only InnoDB tables etc) and risks involved.
Create a replicated copy of your database on an other machine or instance, when you setup the replication issue stop slave command and alter the table. If you have more than one table, between each conversation you may consider issuing again start slave to synchronise two databases. (If you do not this it may take longer to synchronise) When you complete the conversion the replicated copy can replace your old production database and you remove the old one. This is the way i found out to minimize downtime.

MySQL: time needed to set up index for column in the table InnoDB of about 100,000 rows?

I'm curious how long will it take to put an index on the column in the db, that is not empty and has about 100,000 rows. I understand, it depends, but does it take a 0.1 second, 1 second or 10 minutes? Should I stop any actions from users during the indexing of a column in the db?
The reason I ask is I don't know what columns should be indexed (to get an answer, I should measure it on practice, when the table grows), because the table is almost empty and the project is not uploaded.
Thank you.
1) Create a sqldump, load the data into another test database and try it out.
2) Yes, you should prevent user actions during any sort of DB maintenance. Even if the update is fast, something could go wrong and you should give yourself scope for the unexpected. It's also a good practice to get into.
3) The best way to figure out what indices is needed doesn't require much data:
a) Figure out the most frequently called and data intensive queries from a design point of view. Establish what they query on and the index requirements should be a logical next step.
b) Learn how to read query plans and look out for table scans, which you want to avoid.
If you're not sure which fields to index, then my first suggestion would be reading about indexes on MySQL web site.
Secondly, I would suggest you to take a dump of your data, insert that into a TEST environment that is completely separate from your live environment and perform your tests on that machine.
You can even consider creating a slave instance for your live MySQL instance. You stop the slave instance, make your indexing on it, then wait for it to synchronize with master again. Then you can switch master and slave instances so that your live environment doesn't get effected. Finally you can index the ex-master instance. But to do this, you have to know master-slave replications and how to convert a master to be a slave and vice versa.

MySql ALTER TABLE on Production Databases - Any Issues?

I have about 100 databases (all the same structure, just on different servers) with approx a dozen tables each. Most tables are small (lets say 100MB or less). There are occasional edge-cases where a table may be large (lets say 4GB+).
I need to run a series of ALTER TABLE commands on just about every table in each database. Mainly adding some rows to the structure, but a few changes like change a row from a varchar to tinytext (or vice versa). Also adding a few new indexes (but indexing new rows, not existing ones, so assuming that isn't a big deal).
I am wondering how safe this is to do, and if there are any best practices to this process.
First, is there any chance I may corrupt or delete data in the tables. I suspect no, but need to be certain.
Second, I presume for the larger tables (4GB+), this may be a several-minutes to several-hours process?
Anything and everything I should know about performing ALTER TABLE commands on a production database I am interested in learning.
If its of any value knowing, I am planning on issuing commands via PHPMYADMIN for the most part.
Thanks -
First off before applying any changers, make backups. Two ways you can do it: mysqldump everything or you can copy your mysql data folder.
Secondly, you may want to use mysql from the command line. PHPMyAdmin will probably time out. Most PHP server has timeout less than 10 minutes. Or you accidently close the browser.
Here is my suggestion.
You can do fail-over the apps.(make sure no connections on all dbs) .
You can create indexes by using "create index statements" .don't use alter table add index statements.
Do these all using script like(keep all these statements in a file and run from source).
Looks like table sizes are very small so it wont create any headache.

Generating a massive 150M-row MySQL table

I have a C program that mines a huge data source (20GB of raw text) and generates loads of INSERTs to execute on simple blank table (4 integer columns with 1 primary key). Setup as a MEMORY table, the entire task completes in 8 hours. After finishing, about 150 million rows exist in the table. Eight hours is a completely-decent number for me. This is a one-time deal.
The problem comes when trying to convert the MEMORY table back into MyISAM so that (A) I'll have the memory freed up for other processes and (B) the data won't be killed when I restart the computer.
ALTER TABLE memtable ENGINE = MyISAM
I've let this ALTER TABLE query run for over two days now, and it's not done. I've now killed it.
If I create the table initially as MyISAM, the write speed seems terribly poor (especially due to the fact that the query requires the use of the ON DUPLICATE KEY UPDATE technique). I can't temporarily turn off the keys. The table would become over 1000 times larger if I were to and then I'd have to reprocess the keys and essentially run a GROUP BY on 150,000,000,000 rows. Umm, no.
One of the key constraints to realize: The INSERT query UPDATEs records if the primary key (a hash) exists in the table already.
At the very beginning of an attempt at strictly using MyISAM, I'm getting a rough speed of 1,250 rows per second. Once the index grows, I imagine this rate will tank even more.
I have 16GB of memory installed in the machine. What's the best way to generate a massive table that ultimately ends up as an on-disk, indexed MyISAM table?
Clarification: There are many, many UPDATEs going on from the query (INSERT ... ON DUPLICATE KEY UPDATE val=val+whatever). This isn't, by any means, a raw dump problem. My reasoning for trying a MEMORY table in the first place was for speeding-up all the index lookups and table-changes that occur for every INSERT.
If you intend to make it a MyISAM table, why are you creating it in memory in the first place? If it's only for speed, I think the conversion to a MyISAM table is going to negate any speed improvement you get by creating it in memory to start with.
You say inserting directly into an "on disk" table is too slow (though I'm not sure how you're deciding it is when your current method is taking days), you may be able to turn off or remove the uniqueness constraints and then use a DELETE query later to re-establish uniqueness, then re-enable/add the constraints. I have used this technique when importing into an INNODB table in the past, and found even with the later delete it was overall much faster.
Another option might be to create a CSV file instead of the INSERT statements, and either load it into the table using LOAD DATA INFILE (I believe that is faster then the inserts, but I can't find a reference at present) or by using it directly via the CSV storage engine, depending on your needs.
Sorry to keep throwing comments at you (last one, probably).
I just found this article which provides an example of a converting a large table from MyISAM to InnoDB, while this isn't what you are doing, he uses an intermediate Memory table and describes going from memory to InnoDB in an efficient way - Ordering the table in memory the way that InnoDB expects it to be ordered in the end. If you aren't tied to MyISAM it might be worth a look since you already have a "correct" memory table built.
I don't use mysql but use SQL server and this is the process I use to handle a file of similar size. First I dump the file into a staging table that has no constraints. Then I identify and delete the dups from the staging table. Then I search for existing records that might match and put the idfield into a column in the staging table. Then I update where the id field column is not null and insert where it is null. One of the reasons I do all the work of getting rid of the dups in the staging table is that it means less impact on the prod table when I run it and thus it is faster in the end. My whole process runs in less than an hour (and actually does much more than I describe as I also have to denormalize and clean the data) and affects production tables for less than 15 minutes of that time. I don't have to wrorry about adjusting any constraints or dropping indexes or any of that since I do most of my processing before I hit the prod table.
Consider if a simliar process might work better for you. Also could you use some sort of bulk import to get the raw data into the staging table (I pull the 22 gig file I have into staging in around 16 minutes) instead of working row-by-row?

Replication with lots of temporary table writes

I've got a database which I intend to replicate for backup reasons (performance is not a problem at the moment).
We've set up the replication correctly and tested it and all was fine.
Then we realized that it replicates all the writes to the temporary tables, which in effect meant that replication of one day's worth of data took almost two hours for the idle slave.
The reason for that is that we recompute some of the data in our db via cronjob every 15 mins to ensure it's in sync (it takes ~3 minutes in total, so it is unacceptable to do those operations during a web request; instead we just store the modifications without attempting to recompute anything while in the web request, and then do all of the work in bulk). In order to process that data efficiently, we use temporary tables (as there's lots of interdependencies).
Now, the first problem is that temporary tables do not persist if we restart the slave while it's in the middle of processing transactions that use that temp table. That can be avoided by not using temporary tables, although this has its own issues.
The more serious problem is that the slave could easily catch up in less than half an hour if it wasn't for all that recomputation (which it does one after the other, so there's no benefit of rebuilding the data every 15 mins... and you can literally see it stuck at, say 1115, only to quickly catch up and got stuck at 1130 etc).
One solution we came up with is to move all that recomputation out of the replicated db, so that the slave doesn't replicate it. But it has disadvantages in that we'd have to prune the tables it eventually updates, making our slave in effect "castrated", ie. we'd have to recompute everything on it before we could actually use it.
Did anyone have a similar problem and/or how would you solve it? Am I missing something obvious?
I've come up with the solution. It makes use of replicate-do-db mentioned by Nick. Writing it down here in case somebody had a similar problem.
The problem with just using replicate-(wild-)do* options in this case (like I said, we use temp tables to repopulate a central table) is that either you ignore temp tables and repopulate the central one with no data (which causes further problems as all the queries relying on the central table being up-to-date will produce different results) or you ignore the central table, which has a similar problem. Not to mention, you have to restart mysql after adding any of those options to my.cnf. We wanted something that would cover all those cases (and future ones) without the need for any further restart.
So, what we decided to do is to split the database into the "real" and a "workarea" databases. Only the "real" database is replicated (I guess you could decide on a convention of table names to be used for replicate-wild-do-table syntax).
All the temporary table work is happening in "workarea" db, and to avoid the dependency problem mentioned above, we won't populate the central table (which sits in "real" db) by INSERT ... SELECT or RENAME TABLE, but rather query the tmp tables to generate a sort of a diff on the live table (ie. generate INSERT statements for new rows, DELETE for the old ones and update where necessary).
This way the only queries that are replicated are exactly the updates that are required, nothing else, ie. some (most?) of the recomputation queries hapenning every fifteen minutes might not even make its way to slave, and the ones that do will be minimal and not computationally expensive at all, just simple INSERTs and DELETEs.
In MySQL, as of 5.0 I believe, you can do table wildcards to replicate specific tables. There are a number of command-line options that can be set but you can also do this via your MySQL config file.
[mysqld]
replicate-do-db = db1
replicate-do-table = db2.mytbl2
replicate-wild-do-table= database_name.%
replicate-wild-do-table= another_db.%
The idea being that you tell it to not replicate any tables other than the ones you specify.