I have a raw text file, with the size of 8.1GB.
The input data is very straight forward:
Lab_A (string), Lab_B (string), Distance (float)
I was trying to load the data into a table, using LOAD DATA INFILE, but the drive ran out of space.
The destination table had the following format:
Id (INT), Lab_A (VARCHAR), Lab_B (VARCHAR), Distance (FLOAT).
With a primary key of Id and an index of (Lab_A + Distance).
Create statement below:
CREATE TABLE 'warwick_word_suite'.'distances' (
'id' INT NOT NULL AUTO_INCREMENT,
'label1' VARCHAR(45) NOT NULL,
'label2' VARCHAR(45) NOT NULL,
'distance' FLOAT NOT NULL,
PRIMARY KEY ('id'),
INDEX 'LABEL_INDEX' ('label1' ASC, 'distance' ASC));
The drive had 50GB and ran out of space. Given 10GB reserved for the system, I am assuming the table was requesting more than > 32GB for the table.
My question is:
How much do InnoDB tables actually take up, relative to the size of the input data?
Do indexed tables take up a lot more space, compared to identical unindexed tables?
Should I simply order a bigger drive for my database server?
EDIT:
I tracked down the data hog to "ibdata1", stored in /var/lib/mysql. This file is taking up 30.3GB.
Double trouble.
InnoDB takes 2x-3x what the raw data takes. This is a crude approximation; there are many factors.
ibdata1 is the default place to put the table. Having tried to put the table there, that file will not shrink. This can be a problem. It would have been better to have innodb_file_per_table = ON before trying to load the file. Then the table would have gone into a separate .ibd file, and upon failure, that file would have vanished. As it is, you are low on disk space with no simple way to recover it. (Recovery includes dumping all the other InnoDB tables, stopping mysqld, removing ibdata1, restarting, and then reloading the other tables.
Back to the ultimate problem... How to use the data. First, can we see a sample (a few rows) of the data. There may be some clues. How many rows in the table (or lines in the file)?
This may be a case for loading into MyISAM instead of InnoDB; the size for that table will be closer to 8.1GB, plus two indexes, which may add another 5-10GB. Still unpleasantly tight.
Normalizing the lab names would probably be a big gain. Suppose you have 10K labs and 100M distances (every lab to every other lab). Half of those are redundant? Normalizing lab names would save maybe 50 bites per row -- perhaps half the space?
Or you could get more disk space.
Ponder which suggestion(s) of the above you want to tackle; the let us know what you still need help with.
Related
I have a production mysql 8 server that has a table for user sessions for a PHP application. I am using innodb_file_per_table. The table is small at any given time (about 300-1000 rows), but rows are constantly being deleted and added. Without interference, the sessions.ibd file slowly grows until it takes up all available disk space. This morning, the table was 300 records and took up over 90GB. This built up over the long term (months).
Running OPTIMIZE TABLE reclaims all of the disk space and brings the table back under 100M. An easy solution would be to make a cron script that runs OPTIMIZE TABLE once a week during our maintenance period. Another proposed suggestion is to convert the table to a MyISAM table, since it doesn't really require any of the features of InnoDB. Both of these solutions should be effective, but they are table specific and don't protect against the general problem. I'd like to know whether there is a solution to this problem that involves database configuration.
Here are the non-default innodb configuration options that we're using:
innodb-flush-log-at-trx-commit = 1
innodb-buffer-pool-size = 24G
innodb-log-file-size = 512M
innodb-buffer-pool-instances = 8
Are there other options we should be using so that the sessions.ibd file doesn't continually grow?
Here is the CREATE TABLE for the table:
CREATE TABLE `sessions` (
`id` varchar(255) NOT NULL DEFAULT '',
`data` mediumtext,
`expires` int(11) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
In addition to additions and subtractions, the data column is updated often.
MyISAM would have a different problem -- fragmentation. After a delete, there is a hole in the table. The hole is filled in first. Then a link is made to the next piece of the record. Eventually fetching a row would involve jumping around most of the table.
If 300 rows takes 100MB, a row averages 333KB? That's rather large. And does that number vary a lot? Do you have a lot of text/blob columns? Do they change often? Or is it just delete and add? Would you care to share SHOW CREATE TABLE.
I can't think how a table could grow by a factor of 900 without having at least one very multi-GB row added, then deleted. Perhaps with the schema, I could think of some cause and/or workaround.
I have a huge (and growing) MyISAM table (700millions rows = 140Gb).
CREATE TABLE `keypairs` (
`ID` char(60) NOT NULL,
`pair` char(60) NOT NULL,
PRIMARY KEY (`ID`)
) ENGINE=MyISAM
The table option was changed to ROW_FORMAT=FIXED, cause both columns are always fixed length to max (60). And yes yes, ID is well a string sadly and not an INT.
SELECT queries are pretty ok in speed efficiency.
Databases and mysql engine are all 127.0.0.1/localhost. (nothing distant)
Sadly, INSERT is slow as hell. I dont even talk about trying to LOAD DATA millions new rows... takes days.
There won't have any concurrent read on it. All SELECTs are done one by one by only my local server.(it is not for client's use)
(for infos : files sizes .MYD=88Gb, .MYI=53Gb, .TMM=400Mb)
How could i speed up inserts into that table?
Would it help to PARTITION that huge table ? (how then?)
I heard MyISAM is using "structure cache" as .frm files. And that a line into config file is helping mysql keep in memory all the .frm (in case of partitionned), would it help also? Actualy, my .frm file is 9kb only for 700millions rows)
string shortenning/compress function... the ID string? (same idea as rainbow tables) even if it lowers the max allowed unique ID's, i will anyway never reach the max of 60chars. so maybe its an idea? but before creating a new unique ID i have to check if shortened string doesn't exists in db ofc
Same idea as shortening ID strings, what about using md5() on the ID? shorten string means faster or not in that case?
Sort the incoming data before doing the LOAD. This will improve the cacheability of the PRIMARY KEY(id).
PARTITIONing is unlikely to help, unless there is some useful pattern to ID.
PARTITIONing will not help for single-row insert nor for single-row fetch by ID.
If the strings are not a constant width of 60, you are wasting space and speed by saying CHAR instead of VARCHAR. Change that.
MyISAM's FIXED is useful only if there is a lot of 'churn' (deletes+inserts, and/or updates).
Smaller means more cacheable means less I/O means faster.
The .frm is an encoding of the CREATE TABLE; it is not relevant for this discussion.
A simple compress/zip/whatever will almost always compress text strings longer than 10 characters. And they can be uncompressed, losslessly. What do your strings look like? 60-character English text will shrink to 20-25 bytes.
MD5 is a "digest", not a "compression". You cannot recover the string from its MD5. Anyway, it would take 16 bytes after converting to BINARY(16).
The PRIMARY KEY is a BTree. If ID is somewhat "random", then the 'next' ID (unless the input is sorted) is likely not to be cached. No, the BTree is not rebalanced all the time.
Turning the PRIMARY KEY into a secondary key (after adding an AUTO_INCREMENT) will not speed things up -- it still has to update the BTree with ID in it!
How much RAM do you have? For your situation, and for this LOAD, set MyISAM's key_buffer_size to about 70% of available RAM, but not bigger than the .MYI file. I recommend a big key_buffer because that is where the random accesses are occurring; the .MYD is only being appended to (assuming you have never deleted any rows).
We do need to see your SELECTs to make sure these changes are not destroying performance somewhere else.
Make sure you are using CHARACTER SET latin1 or ascii; utf8 would waste a lot more space with CHAR.
Switching to InnoDB will double, maybe triple, the disk space for the table (data+index). Therefore, it will probably show down. But a mitigating factor is that the PK is "clustered" with the data, so you are not updating two things for each row inserted. Note that key_buffer_size should be lowered to 10M and innodb_buffer_pool_size should be set to 70% of available RAM.
(My bullet items apply to InnoDB except where MyISAM is specified.)
In using InnoDB, it would be good to try to insert 1000 rows per transaction. Less than that leads to more transaction overhead; more than that leads to overrunning the undo log, causing a different form of slowdown.
Hex ID
Since ID is always 60 hex digits, declare it to be BINARY(30) and pack them via UNHEX(...) and fetch via HEX(ID). Test via WHERE ID = UNHEX(...). That will shrink the data about 25%, and MyISAM's PK by about 40%. (25% overall for InnoDB.)
To do just the conversion to BINARY(30):
CREATE TABLE new (
ID BINARY(30) NOT NULL,
`pair` char(60) NOT NULL
-- adding the PK later is faster for MyISAM
) ENGINE=MyISAM;
INSERT INTO new
SELECT UNHEX(ID),
pair
FROM keypairs;
ALTER TABLE keypairs ADD
PRIMARY KEY (`ID`); -- For InnoDB, I would do differently
RENAME TABLE keypairs TO old,
new TO keypairs;
DROP TABLE old;
Tiny RAM
With only 2GB of RAM, a MyISAM-only dataset should use something like key_buffer_size=300M and innodb_buffer_pool_size=0. For InnoDB-only: key_buffer_size=10M and innodb_buffer_pool_size=500M. Since ID is probably some kind of digest, it will be very random. The small cache and the random key combine to mean that virtually every insert will involve a disk I/O. My first estimate would be more like 30 hours to insert 10M rows. What kind of drives do you have? SSDs would make a big difference if you don't already have such.
The other thing to do to speed up the INSERTs is to sort by ID before starting the LOAD. But that gets tricky with the UNHEX. Here's what I recommend.
Create a MyISAM table, tmp, with ID BINARY(30) and pair, but no indexes. (Don't worry about key_buffer_size; it won't be used.)
LOAD the data into tmp.
ALTER TABLE tmp ORDER BY ID; This will sort the table. There is still no index. I think, without proof, that this will be a filesort, which is much faster that "repair by key buffer" for this case.
INSERT INTO keypairs SELECT * FROM tmp; This will maximize the caching by feeding rows to keypairs in ID order.
Again, I have carefully spelled out things so that it works well regardless of which Engine keypairs is. I expect step 3 or 4 to take the longest, but I don't know which.
Optimizing a table requires that you optimize for specific queries. You can't determine the best optimization strategy unless you have specific queries in mind. Any optimization improves one type of query at the expense of other types of queries.
For example, if your query is SELECT SUM(pair) FROM keypairs (a query that would have to scan the whole table anyway), partitioning won't help, and just adds overhead.
If we assume your typical query is inserting or selecting one keypair at a time by its primary key, then yes, partitioning can help a lot. It all depends on whether the optimizer can tell that your query will find its data in a narrow subset of partitions (ideally one partition).
Also make sure to tune MyISAM. There aren't many tuning options:
Allocate key_buffer_size as high as you can spare to cache your indexes. Though I haven't ever tried anything higher than about 10GB, and I can't guarantee that MyISAM key buffers are stable at 53GB (the size of your MYI file).
Pre-load the key buffers: https://dev.mysql.com/doc/refman/5.7/en/cache-index.html
Size read_buffer_size and read_rnd_buffer_size appropriately given the queries you run. I can't give a specific value here, you should test different values with your queries.
Size bulk_insert_buffer_size to something large if you want to speed up LOAD DATA INFILE. It's 8MB by default, I'd try at least 256MB. I haven't experimented with that setting, so I can't speak from experience.
I try not to use MyISAM at all. MySQL is definitely trying to deprecate its use.
...is there a mysql command to ALTER TABLE add INT ID increment column automatically?
Yes, see my answer to https://stackoverflow.com/a/251630/20860
First, your primary key is not incrementable.
Which means, roughly: at every insert the index have to be rebalanced.
No wonder it goes slowpoke at the table of such a size.
And such an engine...
So, to the second: what's the point of keeping that MyISAM old junk?
Like, for example, you don't mind to loose row or two (or -teen) in case of an accident? And etc, etc, etc, even setting aside that current MySQL maintainer (Oracle Corp) explicitly discourages usage of MyISAM.
So, here are possible solutions:
1) Switch to Inno;
2) If you can't surrender the char ID, then:
Add autoincrement numerical key and set it primary - then, index would be clustered and the cost of insert would drop significantly;
Turn your current key into secondary index;
3) In case you can - it's obvious
I have a table with more than 300,000 records, of size approximately 1.5 GB
In that table I have three varchar(5000) fields, the rest are small fields.
In issuing an update, setting those three fields to ''.
After a shrink (database and files) the database uses almost the same space as before...
DBCC SHRINKDATABASE(N'DataBase' )
DBCC SHRINKFILE (N'DataBase' , 1757)
DBCC SHRINKFILE (N'DataBase_log' , 344)
Any ideas on how to reclaim that disk space?
Essentially, you have to "move" the contents of the table from one place on the hard drive to another. When so moved, SQL will "repack" the contents of the pages efficiently. Just replacing 5000 bytes of data with 3 (or 0 and a flipped null bitmask) will not cause SQL to revised or rewrite the contents of the table's pages.
If the table has a clustered index, just reindexing it (ALTER INDEX... REBUILD...) will do the trick.
If the table does not have a clustered index, you can either create one and then drop it, or SELECT...INTO... a new table, drop the old table, and rename the new one to the original name.
Just because you set the column to nil doesn't mean the database will reorg the table. The updated record will still fit on the same page it fit on before (the amount of free space on the page will increase).
Also, you do know, don't you, that varchar(5000) doesn't mean that it takes up 5000 octets? It's variable length -- a two-octet length prefix containing the data length of the field, followed by the data octects. Setting a varchar(5000) column in a row to 'foobar' will required 8 octets of space (2+6).
Re-build your indices, including the clustering index.
If you don't have a clustering index, add one. That will force a reorg of the table. Now drop the clustering index.
Now when you shrink the datafile, you should reclaim some disk space.
I've just had to set those fields to null, issue the shrink, and then set them to ''
and the db went from 1.5 GB to 115 MB
pretty strange...
--
in fact, setting those fields to nullable -that means recreating the whole table- did the trick
I am trying to get a better understanding about insertion speed and performance patterns in mysql for a custom product. I have two tables to which I keep appending new rows. The two tables are defined as follows:
CREATE TABLE events (
added_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
id BINARY(16) NOT NULL,
body MEDIUMBLOB,
UNIQUE KEY (id)) ENGINE InnoDB;
CREATE TABLE index_fpid (
fpid VARCHAR(255) NOT NULL,
event_id BINARY(16) NOT NULL UNIQUE,
PRIMARY KEY (fpid, event_id)) ENGINE InnoDB;
And I keep inserting new objects to both tables (for each new object, I insert the relevant information to both tables in one transaction). At first, I get around 600 insertions / sec, but after ~ 30000 rows, I get a significant slowdown (around 200 insertions/sec), and then a more slower, but still noticeable slowdown.
I can see that as the table grows, the IO wait numbers get higher and higher. My first thought was memory taken by the index, but those are done on a VM which has 768 Mb, and is dedicated to this task alone (2/3 of memory are unused). Also, I have a hard time seeing 30000 rows taking so much memory, even more so just the indexes (the whole mysql data dir < 100 Mb anyway). To confirm this, I allocated very little memory to the VM (64 Mb), and the slowdown pattern is almost identical (i.e. slowdown appears after the same numbers of insertions), so I suspect some configuration issues, especially since I am relatively new to databases.
The pattern looks as follows:
I have a self-contained python script which reproduces the issue, that I can make available if that's helpful.
Configuration:
Ubuntu 10.04, 32 bits running on KVM, 760 Mb allocated to it.
Mysql 5.1, out of the box configuration with separate files for tables
[EDIT]
Thank you very much to Eric Holmberg, he nailed it. Here are the graphs after fixing the innodb_buffer_pool_size to a reasonable value:
Edit your /etc/mysql/my.cnf file and make sure you allocate enough memory to the InnoDB buffer pool. If this is a dedicated sever, you could probably use up to 80% of your system memory.
# Provide a buffer pool for InnoDB - up to 80% of memory for a dedicated database server
innodb_buffer_pool_size=614M
The primary keys are B Trees so inserts will always take O(logN) time and once you run out of cache, they will start swapping like mad. When this happens, you will probably want to partition the data to keep your insertion speed up. See http://dev.mysql.com/doc/refman/5.1/en/partitioning.html for more info on partitioning.
Good luck!
Your indexes may just need to be analyzed and optimized during the insert, they gradually get out of shape as you go along. The other option of course is to disable indexes entirely when you're inserting and rebuild them later which should give more consistent performance.
Great link about insert speed.
ANALYZE. OPTIMIZE
Verifying that the insert doesn't violate a key constraint takes some time, and that time grows as the table gets larger. If you're interested in flat out performance, using LOAD DATA INFILE will improve your insert speed considerably.
does anyone knows why I get an overhead of 131.0 MiB on a newly created table (zero rows)?
im using phpmy admin and the code of my script is
CREATE TABLE IF NOT EXISTS `mydb`.`mytable` (
`idRol` INT NOT NULL AUTO_INCREMENT ,
`Rol` VARCHAR(45) NOT NULL ,
PRIMARY KEY (`idRol`) )
ENGINE = InnoDB;
thanks in advance.
InnoDB uses a shared table space. That means that per default all the tables regardless of database are stored in a single file in the filesystem. This differs from for example MyISAM which stores every table as a single file.
The behaviour of InnoDB can be changed, although I don't think it's really necessary in this case. See Using Per-Table Tablespaces.
The overhead is probably the space left by deleted rows, and InnoDB will reuse it when you insert new data. It's nothing to be concerned about.
It might be because mysql generated an index on 'idRol'
Storing an index takes some space, but I am not sure if this is the reason. It's only a guess. I'm not a DBA.