MySQL: add a field to a large table - mysql

i have a table with about 200,000 records. i want to add a field to it:
ALTER TABLE `table` ADD `param_21` BOOL NOT NULL COMMENT 'about the field' AFTER `param_20`
but it seems a very heavy query and it takes a very long time, even on my Quad amd PC with 4GB of RAM.
i am running under windows/xampp and phpMyAdmin.
does mysql have a business with every record when adding a field?
or can i change the query so it makes the change more quickly?

MySQL will, in almost all cases, rebuild the table during an ALTER**. This is because the row-based engines (i.e. all of them) HAVE to do this to retain the data in the right format for querying. It's also because there are many other changes you could make which would also require rebuilding the table (such as changing indexes, primary keys etc)
I don't know what engine you're using, but I will assume MyISAM. MyISAM copies the data file, making any necessary format changes - this is relatively quick and is not likely to take much longer than the IO hardware can get the old datafile in and the new on out to disc.
Rebuilding the indexes is really the killer. Depending on how you have it configured, MySQL will either: for each index, put the indexed columns into a filesort buffer (which may be in memory but is typically on disc), sort that using its filesort() function (which does a quicksort by recursively copying the data between two files, if it's too big for memory) and then build the entire index based on the sorted data.
If it can't do the filesort trick, it will just behave as if you did an INSERT on every row, and populate the index blocks with each row's data in turn. This is painfully slow and results in far from optimal indexes.
You can tell which it's doing by using SHOW PROCESSLIST during the process. "Repairing by filesort" is good, "Repairing with keycache" is bad.
All of this will use AT MOST one core, but will sometimes be IO bound as well (especially copying the data file).
** There are some exceptions, such as dropping secondary indexes on innodb plugin tables.

You add a NOT NULL column, the tuples need to be populated. So it will be slow...

This touches each of 200.000 records, as each record needs to be updated with a new bool value which is not going to be null.
So; yes it's an expensive query... There is nothing you can do to make it faster.

Related

mysql partitioning by key internal hashing function

We have a table partitioned by key (binary(16))
Is there any option to calculate which partition record will go outside of MySQL?
What is the hash function (not linear one)?
The reason is to sort the CSV files outside MySQL and insert them in parallel in right partitions with LOAD DATA INFILE and then index in parallel too.
I can't find the function in MySQL docs
What's wrong with Linear? Are trying to LOAD in parallel?
How many indexes do you have? If only that hash, sort the table, then load into a non-partitioned InnoDB with the PK already in place. Meanwhile, make sure every column uses the smallest possible datatype. How much RAM do you have?
If you are using MyISAM, consider MERGE - With that, you can load each partition-like table as in a separate thread. When finished, construct the "merge" table that combines them.
What types of queries will you be using? Single row lookups by the BINARY(16)? Anything else might have big performance issues.
How much RAM? We need to tune either key_buffer_size or innodb_buffer_pool_size.
Be aware of the limitations. MyISAM defaults to a 7-byte data pointer and a 6-byte index pointer. 15TB would need only a 6-byte data pointer if the rows are DYNAMIC (byte pointer), or 5 bytes if they are FIXED (row number). So that could be 1 or 2 bytes to be saved. If anything is variable length, go with Dynamic; it would waste too much space (and probably not improve speed) to go fixed. I don't know of the index pointer can be shrunk in your case.
You are in 5.7? MySQL; 8.0 removes MyISAM. Meanwhile, MariaDB still handles MyISAM.
Will you first split the data by "partition"? Or send off INSERTs to different "partitions" one by one. (This choice adds some more wrinkles and possibly optimizations.)
Maybe...
Sort the incoming data by the binary version of MD5().
Split into chunks based on the first 4 bits. (Or do the split without sorting first) Be sure to run LOAD DATA for one 4-bit value in only one thread.
Have PARTITION BY RANGE with 16 partitions:
VALUES LESS THAN 0x1000000000000000
VALUES LESS THAN 0x2000000000000000
...
VALUES LESS THAN 0xF000000000000000
VALUES LESS THAN MAXVALUE
I don't know of a limit on the number of rows in a LOAD DATA, but I would worry about ACID locks having problems if you go over, say, 10K rows at a time.
This technique may even work for a non-partitioned table.

Mariadb table defragmentation using OPTIMIZE

We are running MariaDB v 10.1.30, testing a script to run database maintenance script for defragmenting tables and rebuilding indexes using OPTIMIZE TABLE command by using the new 10.1.1 patch of setting innodb_defragment = 1.
I've tested Alter Table with Alogorithm = INPLACE, works fine but I'm trying to make use of innodb_defragment and use optimize to avoid creating temp files when the tables are being rebuilt as done by Alter table INPLACE algorithm.
On using Optimize, there are no temp tables created however the table gets locked not allowing concurrent connections which is not the case with Alter Table with Alogorithm = INPLACE, the documentation however mentions that the optimize is done using INPLACE algorithm.
https://mariadb.org/defragmenting-unused-space-on-innodb-tablespace/
Is this a bug or am i missing something here, please advise.
The benefit for speed is virtually nil.
A "point query" (where you have the key and can go directly to the row) depends on the depth of the BTree. For a million rows, the depth will be about 3. For a trillion rows, about 6. Optimizing a table is very unlikely to shrink the depth.
A "range scan" (BETWEEN, >, etc) walks across a block, looking at each row. Then it hops (via a link) to the next block until it has found all the rows needed. Sure, you will touch more blocks in an un-optimized table, but the bulk of the effort is in accessing each row.
The benefit for space is limited.
An INSERT may add to a non-full block or it may split a full block into two half-full blocks. Later, two adjacent, somewhat empty, blocks will be merged together. Hence, a BTree naturally gravitates toward a state where the average block is 69% full. That is, the benefit of OPTIMIZE TABLE for space is limited.
Phrased differently, OPTIMIZE might shrink the disk footprint for a table to only 69% of what it was, but subsequent operations will just grow the table again.
If you are using innodb_file_per_table=OFF, then OPTIMIZE cannot return the free blocks to the Operating system. Such blocks can be reused for future INSERTs.
OPTIMIZE TABLE is invasive.
It copies the table over, locking it during the process. This is unacceptable to sites that need 100% uptime.
If you are using replication, subsequent writes may stack up behind the OPTIMIZE, thereby making the Slave not up-to-the-second.
Big DELETEs
After deleting lots of rows, there may be benefit to OPTIMIZE, but check the 69% estimate.
If big deletes are a common occurrence, perhaps there are other things that you should be doing. See http://mysql.rjweb.org/doc.php/deletebig
History and internals
Old version did OPTIMIZE in a straightforward way: Create a new table (same schema); copy rows into it: rename table; drop. Writes could not be allowed.
ALGORITHM=INPLACE probably locks a few blocks, combines them to fill up one block, then slides forward. This requires some degree of locking. Based on the Question, it sounds like it simply locks the whole table.
Note that each BTree (the PK+Data, or a secondary index), could be 'optimized' independently. But no command allows for doing such for just the main BTree (PK+data). Optimizing a single secondary index can be done by DROP INDEX + ADD INDEX, but that loses the index. Instead, consider do a NOCOPY ADD INDEX, then INSTANT DROP INDEX. Caution: This could impact USE_INDEX or FORCE INDEX if you are using such.
(Caveat: This Answer applies to InnoDB, not MyISAM.)

Creating indexes on large tables in MySQL (MariaDB) takes a verrry looong time

I have a table with a few billion rows of data and I am trying to build 5 indexes on it at once. The table format is MyISAM to save space. Once I build the indexes this will be a static table, I just need it to be read only.
I created the indexes using this command:
alter table links8 add index(uid,tid), add index (date), add index (tid), add index (userid), add index (updated,uid,tid,userid,date);
The command has been running for over 45 days. You read that right: 45 DAYS. I can see that the temp files are still being accessed, it isn't a dead query.
My question is: wtf? Seems like it should take a few hours at most to sort and build an index even with a few billion rows.
Since I have a static table, is there another storage engine that makes sense to use? Innodb takes up way too much space.
45 days doesn't seem right, because in that time, MySQL is bound to do something, and that something is likely either consuming RAM or storage, likely both, which means that you should have run out of either at some point.
I'd assume it's RAM, because that usually is where things get sparse ;)
Now, you're absolutely right, sorting a few billion values in memory shouldn't take ages. Sorting a few billion values that are the concatenated values in (updated,uid,tid,userid,date) though most likely doesn't happen in RAM. Assuming updated and date are of type datetime, they take 8 bytes each; uid,tid,userid would normally be 32 bit ints, but since your table has > 2**32 entries (I'm assuming that), unique ID's would be 8 byte long, too. So one value of type (updated,uid,tid,userid,date) would be 40B long.
Now throw in let's say 5 billion of these; you get 200 GB of pure row data that you'll need to sort to build an index. Assuming you're not doing this on some huge machine, you obviously need to swap out parts of these values to disk -- since you see temporary files appear, my wild guess is that this is happening, and MySQL is actively doing that itself. Now, sorting algorithms that work on parts of the rows iteratively are much slower, because first you sort all parts, then you mix up the parts in a manner that's better sorted than before, than you re-partition your data, you sort your parts ... with storing and loading from disk in between.
By the way, a 45 day lasting memory operation is likely to be prone to memory bit errors, if no correctional measures are taken (basically, use ECC for this kind of task, or you end up with indexed, corrupted data).
MySQL themselves suggest that you just build a special MD5 index that takes the hash of your search tuple and looks for that, since sorting 128bit (==16 byte) MD5 hashes might be easier than sorting 5*8Byte == 40*8 bit == 320bit long composite rows.
I found a better solution.
I created a new table with the indexes already in place then issued an insert from one table to the other. The way this works is it fills the MYD (raw data file) up and then creates the indexes after that. Once it has started creating the indexes I killed the query. Then on the filesystem I used myisamchk to repair the table manually.
That command looked like this:
myisamchk --force --fast --update-state --key_buffer_size=2000M --sort_buffer_size=2000M --read_buffer_size=10M --write_buffer_size=10M TABLE.MYI
And the whole thing took less than 12 hours and the data looks good!
UPDATE:
Here is the flow summarized.
create table2 indentical to table1 with indexes;
insert into table2 select * from table1;
once the MYD file is full and it starts on the MYI file kill the query
then shutdown mysql and run the myisamchk query and restart mysql
OR
copy table2.MYD and table2.MYI to table3.MYD and table3.MYI, then run myisamchk, then copy table2.frm to table3.frm and change the permissions, when it's all done you should be able to access table3 without a restart of mysql

re-indexing in mysql

I have a table which already contains an index in MySQL. I added some rows to the table, do I need to re-index the table somehow or does MySQL do this for me automatically?
This would be done automatically. This is the reason, why sometimes we don't want to create indexes -- rebuilding of parts of indexes on inserting have small but not empty overhead in performance.
If you define an index in MySQL then it will always reflect the current state of the database unless you have deliberately disabled indexing. As soon as indexing is re-enabled, the index will be brought up to date. Usually indexing is only disabled during large insertions for performance reasons.
There is a cost associated with each index on your table. While a good index can speed up retrieval times immensely, every index you define slows insertion by a small amount. The insertion costs grow slowly with the size of the database. This is why you should only define indexes you absolutely need if you are going to be working on large sets of data.
If you want to see what indexes are defined, you can use SHOW CREATE TABLE to have a look at a particular table.
No, you didn't need to rebuild index.
Record insertion will automatically affect old index..

Generating a massive 150M-row MySQL table

I have a C program that mines a huge data source (20GB of raw text) and generates loads of INSERTs to execute on simple blank table (4 integer columns with 1 primary key). Setup as a MEMORY table, the entire task completes in 8 hours. After finishing, about 150 million rows exist in the table. Eight hours is a completely-decent number for me. This is a one-time deal.
The problem comes when trying to convert the MEMORY table back into MyISAM so that (A) I'll have the memory freed up for other processes and (B) the data won't be killed when I restart the computer.
ALTER TABLE memtable ENGINE = MyISAM
I've let this ALTER TABLE query run for over two days now, and it's not done. I've now killed it.
If I create the table initially as MyISAM, the write speed seems terribly poor (especially due to the fact that the query requires the use of the ON DUPLICATE KEY UPDATE technique). I can't temporarily turn off the keys. The table would become over 1000 times larger if I were to and then I'd have to reprocess the keys and essentially run a GROUP BY on 150,000,000,000 rows. Umm, no.
One of the key constraints to realize: The INSERT query UPDATEs records if the primary key (a hash) exists in the table already.
At the very beginning of an attempt at strictly using MyISAM, I'm getting a rough speed of 1,250 rows per second. Once the index grows, I imagine this rate will tank even more.
I have 16GB of memory installed in the machine. What's the best way to generate a massive table that ultimately ends up as an on-disk, indexed MyISAM table?
Clarification: There are many, many UPDATEs going on from the query (INSERT ... ON DUPLICATE KEY UPDATE val=val+whatever). This isn't, by any means, a raw dump problem. My reasoning for trying a MEMORY table in the first place was for speeding-up all the index lookups and table-changes that occur for every INSERT.
If you intend to make it a MyISAM table, why are you creating it in memory in the first place? If it's only for speed, I think the conversion to a MyISAM table is going to negate any speed improvement you get by creating it in memory to start with.
You say inserting directly into an "on disk" table is too slow (though I'm not sure how you're deciding it is when your current method is taking days), you may be able to turn off or remove the uniqueness constraints and then use a DELETE query later to re-establish uniqueness, then re-enable/add the constraints. I have used this technique when importing into an INNODB table in the past, and found even with the later delete it was overall much faster.
Another option might be to create a CSV file instead of the INSERT statements, and either load it into the table using LOAD DATA INFILE (I believe that is faster then the inserts, but I can't find a reference at present) or by using it directly via the CSV storage engine, depending on your needs.
Sorry to keep throwing comments at you (last one, probably).
I just found this article which provides an example of a converting a large table from MyISAM to InnoDB, while this isn't what you are doing, he uses an intermediate Memory table and describes going from memory to InnoDB in an efficient way - Ordering the table in memory the way that InnoDB expects it to be ordered in the end. If you aren't tied to MyISAM it might be worth a look since you already have a "correct" memory table built.
I don't use mysql but use SQL server and this is the process I use to handle a file of similar size. First I dump the file into a staging table that has no constraints. Then I identify and delete the dups from the staging table. Then I search for existing records that might match and put the idfield into a column in the staging table. Then I update where the id field column is not null and insert where it is null. One of the reasons I do all the work of getting rid of the dups in the staging table is that it means less impact on the prod table when I run it and thus it is faster in the end. My whole process runs in less than an hour (and actually does much more than I describe as I also have to denormalize and clean the data) and affects production tables for less than 15 minutes of that time. I don't have to wrorry about adjusting any constraints or dropping indexes or any of that since I do most of my processing before I hit the prod table.
Consider if a simliar process might work better for you. Also could you use some sort of bulk import to get the raw data into the staging table (I pull the 22 gig file I have into staging in around 16 minutes) instead of working row-by-row?