MySql Very Large File import - mysql

I have a very large MySql Insert Script that is 70GB and has 22,288,088,954 records for inserting.
What is the best (Fastest) way to import this file?
I already set up this:
innodb_buffer_pool_size = 4G
innodb_log_buffer_size = 256M
innodb_log_file_size = 1G
innodb_write_io_threads = 16
innodb_flush_log_at_trx_commit = 0
And I use MySql Command Line Client to execute this file, but it is executing almost 5 days and when I try to see how much total records are inserted after 5 days there are only 4,371,106 records inserted.
I use this sql to find total inserted records:
SELECT SUM(TABLE_ROWS)
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA LIKE 'mydatabase'
MySql is installed on Win 10 Pro PC, with I5 CPU and 32GB RAM, SSD disk
Can anybody help me and suggest a faster way to import this script because if only 4,000,000 records from 22,288,088,954 in 5 days, it will take a very long time to finish.
Thank you in advance.

First, I don't believe you. 22 billion rows in 70 GBytes is just over 3 characters per row. If you are using one insert, that is the (, ), and about one more character.
Second, you no doubt have a problem with the transaction log. Are you doing commits after every insert or few inserts? For optimal performance, you should probably be committing after every 1,000 - 100,000, so the log file doesn't get too big.
Third, the best way to insert the data is not via inserts at all. You should be using load data infile. In less than five days, you should be able to write a script to convert the insert statements into an input data file. Then you can just load that. Better yet, go back to the original data and extract it in a more suitable format.

Related

Optimizing Performance of Bulk Update of Existing MySQL Data

I'm trying to perform a bulk update of around 200,000 existing MySQL rows. More specifically, I need to update eight empty LONG BLOB fields in these rows, each with a ~ 0.5 Mb file (LONG BLOB is used because there are some special cases where significantly large files are stored; however, these are not considered in this bulk update). The files that need to be inserted are stored locally on disk.
I'm using a MATLAB script I wrote to loop through each of the folders that store these files, read the files in and convert them to hexadecimal representations, then execute an UPDATE query to update the eight columns with the eight files for each row.
Initially, things ran fairly quickly; however, I noticed that after a couple thousand completed queries things really started to slow down. I did a bit of research on optimizing MySQL and InnoDB system variables and increased by innodb_buffer_pool_size to 25G and innodb_buffer_pool_instances to 25.
After this modification, things sped up again, but slowed down after another couple thousand queries. I did a little more research and tried to mess around with some other variables such as innodb_log_buffer_size and innodb_log_file_size increasing both to 100M just to see what would happen. I also set innodb_write_io_threads and innodb_read_io_threads to 16 as I am running this all on a fairly high-end server with 32 GB of RAM. Unfortunately, these modifications didn't help much and now I'm stuck with queries taking a few minutes each to complete.
Does anyone have any suggestions or ideas on how I can optimize this process and have it run as fast as possible?
Thanks,
Joe
innodb_buffer_pool_instances=8 will likely serve your requirements with less overhead.
innodb_log_buffer_size=10M would 'buffer' before writing to the innodb_log_file AFTER 10M has been accumulated in memory. A ratio of buffer*10=log file size is reasonable.
When innodb_log_buffer_size is the same as innodb_log_file_size, you effectively have NO buffering. Make buffer_size much less than log file size.

Issues with wikipedia dump table pagelinks

I downloaded the enwiki-latest-pagelinks.sql.gz dump from dumps.wikimedia.org/enwiki/latest/.
I upacked the file, its uncompressed size is 37G.
The table structure is this:
SHOW CREATE TABLE wp_dump.pagelinks;
CREATE TABLE `pagelinks` (
`pl_from` int(8) unsigned NOT NULL DEFAULT '0',
`pl_namespace` int(11) NOT NULL DEFAULT '0',
`pl_title` varbinary(255) NOT NULL DEFAULT '',
`pl_from_namespace` int(11) NOT NULL DEFAULT '0',
UNIQUE KEY `pl_from` (`pl_from`,`pl_namespace`,`pl_title`),
KEY `pl_namespace` (`pl_namespace`,`pl_title`,`pl_from`),
KEY `pl_backlinks_namespace` (`pl_from_namespace`,`pl_namespace`,`pl_title`,`pl_from`)
) ENGINE=InnoDB DEFAULT CHARSET=binary
I imported the table into a new, empty database:
mysql -D wp_dump -u root -p < enwiki-latest-pagelinks.sql
The computer I am running the task on has 16G of RAM and the mysql database is located on a SSD, so I was assuming that despite the table's size the import would not take too long.
However, the task is running since over a day and still running. There are no other processes accessing mysql and there is no workload on the computer.
The database file itself now is 79G large.
ls -lh
-rw-r----- 1 mysql mysql 65 May 11 17:40 db.opt
-rw-r----- 1 mysql mysql 8,6K May 12 07:06 pagelinks.frm
-rw-r----- 1 mysql mysql 79G May 13 16:59 pagelinks.ibd
The table now has over 500 million rows.
SELECT table_name, table_rows FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'wp_dump';
+------------+------------+
| table_name | table_rows |
+------------+------------+
| pagelinks | 520919860 |
+------------+------------+
I am wondering:
Is the enwiki-latest-pagelinks.sql really over 79G large?
Does pagelinks really contain over 500 million rows?
Does it really take that long to import the pagelinks table?
Can you provide some metrics the expected table size and the row amount, please?
Update: 14th may, 2017:
insert still running; pagelinks.ibdfile now 130G; number of rows now almost 700 million
Update: 16th may, 2017:
insert still running; pagelinks.ibdfile now 204G; number of rows now over 1.2 billion
I calculated the rows inserted per second over the last two days:
rows/sek = 3236
And: It is many thousand inserts per insert statement in the sql script (head -41 enwiki-latest-pagelinks.sql | tail -1 | grep -o "(" | wc -l is 30471)
So, my follow-up / modified questions:
Is the number of rows and the idb file size to be expected given the sql file size of 37G and the table structure (as listed above)?
Is rows/sek = 3236 a good value (meaning that it takes a few days to insert the table)?
What may be the limiting speed factor / how can I speed up the import?
Disable the indexes (and calculate them after the insert)?
Optimize transactions (commit (nothing set in script) / autocommit (now ON))?
Optimize variable settings (e.g. innodb_buffer_pool_size, now 134217728)?
#Sim Betren: I was currently importing the same table, I can get about 7700 rows/s. Which means about 600.000.000 rows a day. Probably the most important thing is to get the right settings on InnoDB:
https://dba.stackexchange.com/questions/83125/mysql-any-way-to-import-a-huge-32-gb-sql-dump-faster
innodb_buffer_pool_size = 4G
innodb_log_buffer_size = 256M
innodb_log_file_size = 1G
innodb_write_io_threads = 16
innodb_flush_log_at_trx_commit = 0
Those settings work good. From what I read and tried, InnoDB loves high memory settings. Ideally, one would use a 16Gb or even 32Gb machine, then increase these settings even more. But I got 7700 rows/s on a modest setup, that's almost 10 years old:
Intel Q6700 quad
8 Gb DDR2 memory
I combined that 10 year old hardware with a 2017 model 500Gb SSD, which is dedicated to the job and handles both the reading and the writing. the reason for using the older hardware is that the SSD is the most important part of the setup (because of IOPS). Plus by using older hardware I saved a bit of money. However, the hardware is limited to 8Gb of DDR2. A newer dedicated machine with 32Gb or 64Gb internal memory could really fly I reckon.
Software setup:
Linux Mint 64bit
MySQL Server 5.7.18 for Ubuntu
MySQL Workbench for importing
I also tried this on Windows 10 and the speed is almost the same on both. So you could try Windows too.
Note: I did try changing the engine to MyISAM. MyISAM can be pretty fast, also around 8000 rows/sec or more. But the import always became corrupted for some reason. So I would stick to InnoDB
Update 17-06-2017:
Finished the import. The table "pagelinks" is about 214Gb large with 1200 million rows. About 112Gb is raw data, 102Gb are indexes. The original uncompressed file was about 37Gb.
It took about 2 days and 6 hours to import. Avg speed = 5350 rows/second. With high end equipment (huge memory, preferably 64Gb or more) and optimal settings, it can probably be done faster. But I let it run on a dedicated machine 24/7 and I wasn't in a hurry, so 2 days seems OK.
Update 18-06-2017:
Also imported "page.sql" because this contains the names connected to the ID's. The uncompressed file is about 5Gb, import took 1 hour. Which seems quick: the pagelink file is about 37Gb which is 7x bigger than "page.sql". Yet takes 50x longer to import. So, there are a few reasons why "pagelinks" takes so long: (A) probably because it doesn't fit in memory (B) The table structure, many data per insert (C) Settings. But most likely it's memory.
Conclusion: try get a PC with 32Gb or 64Gb internal memory. Maybe even more. And use SSD's that can keep up with that memory, 500Gb or more. The SSD is more important than memory so try that first.
#Sim Betren:
I want to open a completely new answer, since I have discovered a new solution. Splitting up the file is probably the best answer. As discussed in the other answer, InnoDB works best when the entire model fits in memory. The delays start when it needs to swap stuff on disk. The pagelinks file is 37Gb and that's simply too big for most machines to fit easily into memory. Maybe a $1000+ dedicated machine with endless memory can do it, most desktops can't. So what you can do:
The plan is to split the file. First thing to do, is to separate the SQL structure from the data.
There's probably better ways to do that, but a program I found was this:
SqlDumpSplitter2
That dump splitter program maybe old but it worked on pagelinks. It is Windows only though. I simply told it to split the unpacked 37Gb file into 37 chunks of 1Gb and it dutifully did. Checked the data and it seems to be working. You could also use 74 chunks of 500Mb.
The import of each file takes maybe 10 to 20 minutes per 1Gb.
Total time: about 1 to 2 hours for splitting the 37Gb file. About 6 to 12 hours for importing. That easily beats the previous answer I gave
When importing, use the same big data settings as in the previous answer. And try to find a machine that has big memory 16Gb or 32Gb preferred.
What is most important here is: it doesn't really matter that much how you split it. Just split the file anyway you can. Then build it up by re-creating the structure and the data separately. This way import could be down from 2 days to maybe as little as a few hours. Given a big dedicated machine it can probably be done in just 1 to 6 hours.
37GB of data --> 79GB of InnoDB table seems reasonable...
Title: 2 quotes and 1 comma --> 1 byte for length
Ints: several bytes, plus comma --> 4 bytes for INT (regardless of the (...) after INT. See MEDIUMINT.
20-30 bytes overhead per row
20-40% overhead for BTrees.
UNIQUE index becomes PRIMARY KEY and clusters with data --> very little overhead.
Other two indexes: Each is virtually the same size as the data. This more that allows for the increased size.
Adding it all together, I would expect the table to be more than 120GB. So, there are probably some details missing. A guess: The dump is one row per INSERT, instead of the less verbose many-rows-per-INSERT.
As for performance, it all depends on the SELECTs. Set innodb_buffer_pool_size to somewhere around 11G. This may work efficiently enough for caching the 79G.
More
Change UNIQUE to PRIMARY, for clarity and because InnoDB really needs a PK.
Check the source data. Is it in (pl_from,pl_namespace,pl_title) order? If not, can you sort the file before loading? If you can, that, alone, should significantly help the speed.
128MB for the buffer_pool is also significantly hampering progress.

importing table with phpmyadmin takes too long time

i've imported a mysql table stored in a .sql file using my localhost phpymadmin, the table has 14000 records (simple data, 5 fields only) and it took almost 10 minutes. is this normal? i'm running a laptop with win8, core i7 quad and my xampp seems to be configured properly.
thanks
Your hard drive is the limit in this case. Having a single insert per row means your inserting is limited on your hard drives IOPS (I/O operations per second).
Bulk inserting reduces the IOPS but increases the MB/s transfer which is what you want in this case.
so rewriting like
INSERT INTO table VALUES (1,2,3,4),(1,2,3,4)
with comma separated rows will give a huge boost
Putting in a hard drive with higher IOPS will speed it up also if the rewritten query is still slow

MySQL fetch time optimization

oI have a table with 2 millions of registers, but it will grow much more soon. Basically this table contains points of interest of an image with respective descriptors. When I'm trying to execute query that selects points that are spatially near to the query points, total execution time takes too long. More precisely Duration / Fetch = 0.484 sec / 27.441 sec. And the query is quite simple, which returns only ~17000 rows.
My query is:
SELECT fp.fingerprint_id, fp.coord_x, fp.coord_y, fp.angle,
fp.desc1, fp.desc2, fp.desc3, fp.desc4, fp.desc5, fp.desc6, fp.desc7, fp.desc8, fp.desc9, fp.desc10,
fp.desc11, fp.desc12, fp.desc13, fp.desc14, fp.desc15, fp.desc16, fp.desc17, fp.desc18, fp.desc19,
fp.desc20, fp.desc21, fp.desc22, fp.desc23, fp.desc24, fp.desc25, fp.desc26, fp.desc27, fp.desc28,
fp.desc29, fp.desc30, fp.desc31, fp.desc32
FROM fingerprint fp
WHERE
fp.is_strong_point = 1 AND
(coord_x BETWEEN 193-40 AND 193+40) AND (coord_y BETWEEN 49-15 AND 49+15 )
LIMIT 1,1000000;
That is what I've done.
I've tried to change key_buffer_size in my.ini, but didn't see much changes.
In addition I've tried to set coord_x and coord_y as indexes, but query time became slower.
The table is partitioned by range of coord_x field, which gave me better results.
How I can reduce the Fetch time? Is it possible to reduce it to milliseconds?
I faced slow fetch issue too (MySQL, InnoDB).
Finally I found that innodb_buffer_pool_size is set to 8MB by default for my system which is not enough to handle the query. After increasing it to 1GB performance seems fine:
Duration / Fetch
353 row(s) returned 34.422 sec / 125.797 sec (8MB innodb buffer)
353 row(s) returned 0.500 sec / 1.297 sec (1GB innodb buffer)
UPDATE:
To change innodb_buffer_pool_size add this to your my.cnf
innodb_buffer_pool_size=1G
restart your mysql to make it effect
Reference: How to change value for innodb_buffer_pool_size in MySQL on Mac OS?
If i am right the query is really fast but what is slow is the fetching of the data from your database. It takes 27 seconds to load the 170000 results from your storage.
It looks like you use the wrong database type. Try switching the table from one database engine to another.
For maximum speed you can use the MEMORY engine. The only drawback would be that you would have to store a copy of that table in another engine if you have to do dynamic changes to it and after any change you would have to reload the differences or the entire table.
Also you would have to make a script that fires when you restart your server so that your memory table would be loaded on startup of your mysql server
See here for the doc
Increasing my buffer size make myquery faster. But you need to open the my.ini file as notepad++ because it will some hex data if you open it as notepad.
I found a Fix, Just disable your AVG or any antivuris in your system and then restart your workbench
Make sure that the line is not written in your pom.xml.
<property name="hbm2ddl.auto">create</property>
If it is written than remove it.

improving performance of mysql load data infile

I'm trying to bulk load around 12m records into a InnoDB table in a (local) mysql using LOAD DATA INFILE (from CSV) and finding it's taking a very long time to complete.
The primary key type is UUID and the keys are unsorted in the data files.
I've split the data file into files containing 100000 records and import it as:
mysql -e 'ALTER TABLE customer DISABLE KEYS;'
for file in *.csv
mysql -e "SET sql_log_bin=0;SET FOREIGN_KEY_CHECKS=0; SET UNIQUE_CHECKS=0;
SET AUTOCOMMIT=0;LOAD DATA INFILE '${file}' INTO TABLE table
FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'; COMMIT"
This works fine for the first few hundred thousand records but then the insert time for each subsequent load seems to keep growing (from around 7 seconds to around 2 minutes per load before I killed it.)
I'm running on a machine with 8GB RAM and have set the InnoDB parameters to:
innodb_buffer_pool_size =1024M
innodb_additional_mem_pool_size =512M
innodb_log_file_size = 256M
innodb_log_buffer_size = 256M
I've also tried loading a single CSV containing all rows with no luck - this ran in excess of 2 hours before I killed it.
Is there anything else that could speed this up as this seems like an excessive time to only load 12m records?
If you know the data is "clean", then you can drop indexes on the affected tables prior to the import and then re-add them after it is complete.
Otherwise, each record causes an index-recalc, and if you have a bunch of indexes, this can REALLY slow things down.
Its always hard to tell what is the cause of performance issues but these are my 2 cents:
Your key being a uuid is randomly distributed which makes it hard to maintain an index. The reason being that keys are stored by range in a file system block, so having random uuids follow each other makes the OS read and write blocks to the file system without leveraging the cache. I don't know if you can change the key, but you could maybe sort the uuids in the input file and see if that helps.
FYI, to understand this issue better I would take a look at this blog post and maybe read this book mysql high performance it has a nice chapter about innodb clustered index.
Good Luck!