What exactly happened with the indexing operation? - mysql

I was indexing a huge table today containing 2 billion records. I thought MySQL would eat away my 2TB drive... The disk consumption kept increasing to 400GB and then 500GB and then finally drops to 180GB and MySQL says successfully added the index. Why the space increase and what happened in the end? Can someone please give me some pointers?

Incidentally yesterday I answered a question on how to make index creation faster in MySQL, and the following came out from my research:
The CREATE INDEX and DROP INDEX commands work by creating a new, empty table defined with the requested set of indexes. It then copies the existing rows to the new table one-by-one, updating the indexes as it goes. Inserting entries into the indexes in this fashion, where the key values are not sorted, requires random access to the index nodes, and is far from optimal. After all rows from the original table are copied, the old table is dropped and the copy is renamed with the name of the original table.
Source: Overview of Fast Index Creation

Related

Post optimization needed after deleting rows in a MYSQL Database

I have a log table that is currently 10GB. It has a lot of data for the past 2 years, and I really feel at this point I don't need so much in there. Am I wrong to assume it is not good to have years of data in a table (a smaller table is better)?
My tables all have an engine of MYISAM.
I would like to delete all data of 2014 and 2015, and soon i'll do 2016, but i'm concerned about after I run the DELETE statement, what exactly will happen. I understand because it's ISAM there is a lock that will occur where no writing can take place? I would probably delete data by the month, and do it late at night, to minimize this as it's a production DB.
My prime interest, specifically, is this: should I take some sort of action after this deletion? Do I need to manually tell MYSQL to do anything to my table, or is MYSQL going to do all the housekeeping itself, reclaiming everything, reindexing, and ultimately optimizing my table after the 400,000k records I'll be deleting.
Thanks everyone!
Plan A: Use a time-series PARTITIONing of the table so that future deletions are 'instantaneous' because of DROP PARTITION. More discussion here . Partitioning only works if you will be deleting all rows older than X.
Plan B: To avoid lengthy locking, chunk the deletes. See here . This is optionally followed by an OPTIMIZE TABLE to reclaim space.
Plan C: Simply copy over what you want to keep, then abandon the rest. This is especially good if you need to preserve only a small proportion of the table.
CREATE TABLE new LIKE real;
INSERT INTO new
SELECT * FROM real
WHERE ... ; -- just the newer rows;
RENAME TABLE real TO old, new TO real; -- instantaneous and atomic
DROP TABLE old; -- after verifying that all went well.
Note: The .MYD file contains the data; it will never shrink. Deletes will leave holes in it. Further inserts (and opdates) will use the holes in preference to growing the table. Plans A and C (but not B) will avoid the holes, and truly free up space.
Tim and e4c5 have given some good recommendations and I urge them to add their answers.
You can run OPTIMIZE TABLE after doing the deletes. Optimize table will help you with a few things (taken from the docs):
If the table has deleted or split rows, repair the table.
If the index pages are not sorted, sort them.
If the table's statistics are not up to date (and the repair could not be accomplished by sorting the index), update them.
According to the docs: http://dev.mysql.com/doc/refman/5.7/en/optimize-table.html
Use OPTIMIZE TABLE in these cases, depending on the type of table:
...
After deleting a large part of a MyISAM or ARCHIVE table, or making
many changes to a MyISAM or ARCHIVE table with variable-length rows
(tables that have VARCHAR, VARBINARY, BLOB, or TEXT columns). Deleted
rows are maintained in a linked list and subsequent INSERT operations
reuse old row positions. You can use OPTIMIZE TABLE to reclaim the
unused space and to defragment the data file. After extensive changes
to a table, this statement may also improve performance of statements
that use the table, sometimes significantly.

How to reduce index size of a table in mysql using innodb engine?

I am facing a performance issue in mysql due to large index size on my table. Index size has grown to 6GB and my instance is running on 32GB memory. Majority of rows is not required in that table after a few hours and can be removed selectively. But removing them is a time consuming solution and doesn't reduce index size.
Please suggest some solution to manage this index.
You can optimize your table to rebuild index and get back space if not getting even after deletion-
optimize table table_name;
But as your table is bulky so it will lock during optimze table and also you are facing issue how can remove old data even you don't need few hours old data. So you can do as per below-
Step1: during night hours or when there is less traffic on your db, first rename your main table and create a new table with same name. Now insert few hours data from old table to new table.
By this you can remove unwanted data and also new table will be optimzed.
Step2: In future to avoid this issue, you can create a stored procedure. Which will will execute in night hours only 1 time per day and either delete till previous day (as per your requirement) data from this table or will move data to any historical table.
Step3: As now your table always keep only sigle day data then you can execute optimize table statement to rebuild and claim space back on this table easily.
Note: delete statement will not rebuild index and will not free space on server. For this you need to do optimize your table. It can be by various ways like by alter statement or by optimize statement etc.
If you can remove all the rows older than X hours, then PARTITIONing is the way to go. PARTITION BY RANGE on the hour and use DROP PARTITION to remove an old hour and REORGANIZE PARTITION to create a new hour. You should have X+2 partitions. More details.
If the deletes are more complex, please provide more details; perhaps we can come up with another solution that deals with the question about index size. Please include SHOW CREATE TABLE.
Even if you cannot use partitions for purging, it may be useful to have partitions for OPTIMIZE. Do not use OPTIMIZE PARTITION; it optimizes the entire table. Instead, use REORGANIZE PARTITION if you see you need to shrink the index.
How big is the table?
How big is innodb_buffer_pool_size?
(6GB index does not seem that bad, especially since you have 32GB of RAM.)

Adding Index to 3 million rows MySQL

I need to add at least 1 index to a column of type int(1) on an InnoDB table. There are about 3 million rows that it would need to index. This is a database on my production server, and it is in use by thousands of people everyday. I tried to add an index the standard way, but it was taking up too much time (I let it run for about 7 minutes before killing the process) and locking rows, meaning a frozen application for many users.
My VPS that runs all of this has 512mb of RAM and has an Intel Xeon E5504 processor.
How can I add an index to this production database without interrupting my user's experience?
Unless the table either reads XOR writes then you'll probably need to take down the site. Lock the databases, run the operation and wait.
If the table is a write only swap the writes to a temporary table and run the operation on the old table, then swap the writes back to the old table and insert the data from the temporary table.
If the table is read only, duplicate the table and run the operation on the copy.
If the table is a read/write then a messy alternative that might work, is to create a new table with the indexes and set the primary key start point to the next value in the original table, add a join to your read requests to select from both tables, but write exclusively to the new table. Then write a script that inserts from the old table to the new then deletes the row in the old table. It'll take far, far longer than the downtime, and plenty can go wrong, but it should be do-able.
you can set the start point of a primary key with
ALTER TABLE `my_table` AUTO_INCREMENT = X;
hope that helps.
take a look at pt-online-schema-change. i think this tool can be quite useful in your case. it will obviously put additional load on your database server but should not block access to the table for most of the operation time.

Can a killed ALTER TABLE statement leave an index partially built?

We started an ALTER TABLE which dropped one index and added another. Although we were only expecting the table to be locked for writes, reads started queueing up, so we killed the ALTER process. But when the KILL finished, the old index was gone, and the new index was there in its place, with a much lower cardinality than expected.
Searching on the table seems to be faster now, so it seems like the ALTER went through fine, but we're not sure. Is it possible that our KILL has left the index in a partially-built stage?
If the index is there you may assume that it is complete.
You can use SHOW CREATE TABLE or SHOW INDEXES to see the indexes on the table.
As noted in the comments, the cardinality listed by SHOW INDEXES is just an estimate.
One test you can try is to run SHOW INDEXES, then run ANALYZE TABLE, then run SHOW INDEXES again and see how the estimated cardinality value changes.
By your description (both reads/writes locked) you are most likely using an older version of InnoDB, or adding an index to a column in utf8 character set.
Here is how it works in your version:
Empty table is created with the new table definition.
Rows are copied 1 after the other from old table to new table (new indexes are also created).
Once copy is complete, old table is deleted, new table is renamed.
(If you cancel between steps 2&3 the new table is just safely removed.)
For full disclosure - here is how it works in InnoDB plugin (default for MySQL 5.5, available from 5.1+):
Table is read through to find data for the index, and written to a temporary file.
The temporary file is sorted.
The index is created by inserting the data in order.
(This method is more optimized. InnoDB calls is "fast-index creation".)

Generating a massive 150M-row MySQL table

I have a C program that mines a huge data source (20GB of raw text) and generates loads of INSERTs to execute on simple blank table (4 integer columns with 1 primary key). Setup as a MEMORY table, the entire task completes in 8 hours. After finishing, about 150 million rows exist in the table. Eight hours is a completely-decent number for me. This is a one-time deal.
The problem comes when trying to convert the MEMORY table back into MyISAM so that (A) I'll have the memory freed up for other processes and (B) the data won't be killed when I restart the computer.
ALTER TABLE memtable ENGINE = MyISAM
I've let this ALTER TABLE query run for over two days now, and it's not done. I've now killed it.
If I create the table initially as MyISAM, the write speed seems terribly poor (especially due to the fact that the query requires the use of the ON DUPLICATE KEY UPDATE technique). I can't temporarily turn off the keys. The table would become over 1000 times larger if I were to and then I'd have to reprocess the keys and essentially run a GROUP BY on 150,000,000,000 rows. Umm, no.
One of the key constraints to realize: The INSERT query UPDATEs records if the primary key (a hash) exists in the table already.
At the very beginning of an attempt at strictly using MyISAM, I'm getting a rough speed of 1,250 rows per second. Once the index grows, I imagine this rate will tank even more.
I have 16GB of memory installed in the machine. What's the best way to generate a massive table that ultimately ends up as an on-disk, indexed MyISAM table?
Clarification: There are many, many UPDATEs going on from the query (INSERT ... ON DUPLICATE KEY UPDATE val=val+whatever). This isn't, by any means, a raw dump problem. My reasoning for trying a MEMORY table in the first place was for speeding-up all the index lookups and table-changes that occur for every INSERT.
If you intend to make it a MyISAM table, why are you creating it in memory in the first place? If it's only for speed, I think the conversion to a MyISAM table is going to negate any speed improvement you get by creating it in memory to start with.
You say inserting directly into an "on disk" table is too slow (though I'm not sure how you're deciding it is when your current method is taking days), you may be able to turn off or remove the uniqueness constraints and then use a DELETE query later to re-establish uniqueness, then re-enable/add the constraints. I have used this technique when importing into an INNODB table in the past, and found even with the later delete it was overall much faster.
Another option might be to create a CSV file instead of the INSERT statements, and either load it into the table using LOAD DATA INFILE (I believe that is faster then the inserts, but I can't find a reference at present) or by using it directly via the CSV storage engine, depending on your needs.
Sorry to keep throwing comments at you (last one, probably).
I just found this article which provides an example of a converting a large table from MyISAM to InnoDB, while this isn't what you are doing, he uses an intermediate Memory table and describes going from memory to InnoDB in an efficient way - Ordering the table in memory the way that InnoDB expects it to be ordered in the end. If you aren't tied to MyISAM it might be worth a look since you already have a "correct" memory table built.
I don't use mysql but use SQL server and this is the process I use to handle a file of similar size. First I dump the file into a staging table that has no constraints. Then I identify and delete the dups from the staging table. Then I search for existing records that might match and put the idfield into a column in the staging table. Then I update where the id field column is not null and insert where it is null. One of the reasons I do all the work of getting rid of the dups in the staging table is that it means less impact on the prod table when I run it and thus it is faster in the end. My whole process runs in less than an hour (and actually does much more than I describe as I also have to denormalize and clean the data) and affects production tables for less than 15 minutes of that time. I don't have to wrorry about adjusting any constraints or dropping indexes or any of that since I do most of my processing before I hit the prod table.
Consider if a simliar process might work better for you. Also could you use some sort of bulk import to get the raw data into the staging table (I pull the 22 gig file I have into staging in around 16 minutes) instead of working row-by-row?