MYSQL Insert Huge SQL Files of GB in Size - mysql

I'm trying to create a Wikipedia DB copy (Around 50GB), but having problems with the largest SQL files.
I've split the files of size in GB using linux split utility into chunks of 300 MB. e.g.
split -d -l 50 ../enwiki-20070908-page page.input.
On average 300MB files take 3 hours at my server. I've ubuntu 12.04 server OS and Mysql 5.5 Server.
I'm trying like following:
mysql -u username -ppassword database < category.sql
Note: these files consist of Insert statements and these are not CSV files.
Wikipedia offers database dumps for download, so everybody can create a copy of Wikipedia.
You can find example files here: Wikipedia Dumps
I think the import is slow because of the settings for my MySQL Server, but I don't know what I should change. I'm using the standard Ubuntu MySQL config on a machine with a decent processor and 2GB RAM. Could someone help me out with a suitable configuration for my system?
I've tried to set innodb_buffer_pool_size to 1GB but no vains.

Since you have less than 50GB of memory (so you can't buffer the entire database in memory), the bottleneck is the write speed of your disk subsystem.
Tricks to speed up imports:
MyISAM is not transactional, so much faster in single threaded inserts. Try to load into MyISAM, then ALTER the table to INNODB
Use ALTER TABLE .. DISABLE KEYS to avoid index updates line by line (MyISAM only)
Set bulk_insert_buffer_size above your insert size (MyISAM only)
Set unique_checks = 0 so that unique constrains are not checked.
For more, see Bulk Data Loading for InnoDB Tables in MySQL Manual.
Note: If the original table have foreign key constraints, using MyISAM as an intermediate format is a bad idea.

Use MyISAM, usually much faster than InnoDB, if your data base isnt transaction oriented. Did you research into using any table partitioning/sharding techniques?
Converting huge MyISAM into InnoDB will again run into performance issues, so I am not sure I would do that. But disabling and re-enabling keys could be of help...

Related

InnoDB indexes before and after importing

I'm trying to import a large SQL file that was generated by mysqldump for an InnoDB table but it is taking a very long time even after adjusting some parameters in my.cnf and disabling AUTOCOMMIT (as well as FOREIGN_KEY_CHECKS and UNIQUE_CHECKS but the table does not have any foreign or unique keys). But I'm wondering if it's taking so long because of the several indexes in the table.
Looking at the SQL file, it appears that the indexes are being created in the CREATE TABLE statement, prior to inserting all the data. Based on my (limited) research and personal experience, I've found that it's faster to add the indexes after inserting all the data. Does it not have to check the indexes for every INSERT? I know that mysqldump does have a --disable-keys option which does exactly that – disable the keys prior to inserting, but apparently this only works with MyISAM tables and not InnoDB.
But why couldn't mysqldump not include the keys with the CREATE TABLE statement for InnoDB tables, then do an ALTER TABLE after all the data is inserted? Or does InnoDB work differently, and there is no speed difference?
Thanks!
I experimented with this concept a bit at a past job, where we needed a fast method of copying schemas between MySQL servers.
There is indeed a performance overhead when you insert to tables that have secondary indexes. Inserts need to update the clustered index (aka the table), and also update secondary indexes. The more indexes a table has, the more overhead it causes for inserts.
InnoDB has a feature called the change buffer which helps a bit by postponing index updates, but they have to get merged eventually.
Inserts to a table with no secondary indexes are faster, so it's tempting to try to defer index creation until after your data is loaded, as you describe.
Percona Server, a branch of MySQL, experimented with a mysqldump --optimize-keys option. When you use this option, it changes the output of mysqldump to have CREATE TABLE with no indexes, then INSERT all data, then ALTER TABLE to add the indexes after the data is loaded. See https://www.percona.com/doc/percona-server/LATEST/management/innodb_expanded_fast_index_creation.html
But in my experience, the net improvement in performance was small. It still takes a while to insert a lot of rows, even for tables with no indexes. Then the restore needs to run an ALTER TABLE to build the indexes. This takes a while for a large table. When you count the time of INSERTs plus the extra time to build indexes, it's only a few (low single-digit) percents faster than inserting the traditional way, into a table with indexes.
Another benefit of this post-processing index creation is that the indexes are stored more compactly, so if you need to save disk space, that's a better reason to use this technique.
I found it much more beneficial to performance to restore by loading several tables in parallel.
The new MySQL 8.0 tool mysqlpump supports multi-threaded dump.
The open-source tool mydumper supports multi-threaded dump, and also has a multi-threaded restore tool, called myloader. The worst downside of mydumper/myloader is that the documentation is virtually non-existant, so you need to be an intrepid power user to figure out how to run it.
Another strategy is to use mysqldump --tab to dump CSV files instead of SQL scripts. Bulk-loading CSV files is much faster than executing SQL scripts to restore the data. Well, it dumps an SQL file for the table definition, and a CSV for the data to import. It creates separate files for each table. You have to manually recreate the tables by loading all the SQL files (this is quick), and then use mysqlimport to load the CSV data files. The mysqlimport tool even has a --use-threads option for parallel execution.
Test carefully with different numbers of parallel threads. My experience is that 4 threads is the best. With greater parallelism, InnoDB becomes a bottleneck. But your experience may be different, depending on the version of MySQL and your server hardware's performance capacity.
The fastest restore method of all is when you use a physical backup tool, the most popular is Percona XtraBackup. This allows for fast backups and even faster restores. The backed up files are literally ready to be copied into place and used as live tablespace files. The downside is that you must shut down your MySQL Server to perform the restore.

Renaming MySQL Engine in MySQL Dump File

Here is the situation I am stuck in,
Situation
We want to move from MyISAM to InnoDB Engine, so that there will be no table level locks.
Catch
We can get max of 1 hour service downtime and not a minute more than that.
Our DB Machine H/W spec is very low. 8 GB RAM.
Learnings
Recently we learnt that, migrating our DB Engine would take 3 - 4 Hours, including DB Engine Conversion and Re-Indexing. (This was emulated with live DB Dump in offline environment).
This is because DB Engine migration will re-create the schema with InnoDB as the Engine and re-enter all table data into new schema.
What I found
One interesting fact I found is, after the MySQL Dump file is created, If I replace the text MyISAM with InnoDB in the Dump file and then import it into new DB, the max time taken was 50 Mins and all tables were converted to InnoDB along with right indexes.
My Question
Is the approach I took correct?
Does it lead to any data corruption or index corruption?
I did it. No problem. Beware of the features which are only for MyISAM as multiple auto-increment columns, or fulltext indexing.

will mysql compress text field reduce disk space on current database?

I have a rather large mysql database table where one field is longtext, I wish to use compress on this field, would this result in diskspace reduction? or is the storage space already allocated and it won't result in any space reduction. the storage engine is innodb.
InnoDB compressed row format has some prerequisites:
innodb_file_format=Barracuda in your my.cnf file.
innodb_file_per_table in your my.cnf file. The compression doesn't work for tables stored in the central tablespace (ibdata1).
Change a table to use compressed format.
ALTER TABLE MyTable ROW_FORMAT=COMPRESSED
Then the table should be stored in the compressed format, and it will take less space.
The ratio of compression depends on your data.
note: If you had a table stored in the ibdata1 central tablespace, and you restructure the table into file-per-table, ibdata1 will not shrink! The only way to shrink ibdata1 is to dump all your InnoDB data, shutdown MySQL, rm the tablespace, restart MySQL, and reload your data.
Re your comments:
No, it shouldn't change the way you do queries.
Yes, it will take a long time. How long depends on your system. I'd recommend trying it with a smaller table first so you get a feel for it.
It needs to spend some CPU resources to compress when you write, and decompress when you read. It's common for a database server to have extra CPU resources -- databases are more typically constrained by I/O. But the speed might become a bottleneck. Again, it depends on your system, your data, and your usage. I encourage you to test carefully to see if it's a net win or not.

MySQL bulk inserts with LOAD INFILE - MyISAM merely slower than MEMORY engine

We are currently performing several performance tests on MySQL to compare it to an approach we are developing for a database prototype. To say it short: database is empty, given a huge csv file, load the data into memory as fast as possible.
We are testing on a 12-core Westmere server with 48 GB RAM, so memory consumption is right now not a real issue.
The problem is the following. We haven chosen MySQL (widely spread, open source) for comparison. Since our prototype is an in-memory database, we have chosen the memory engine in MySQL.
We insert this way (file are up to 26 GB large):
drop table if exists a.a;
SET ##max_heap_table_size=40000000000;
create table a.a(col_1 int, col_2 int, col_3 int) ENGINE=MEMORY;
LOAD DATA CONCURRENT INFILE "/tmp/input_files/input.csv" INTO TABLE a.a FIELDS TERMINATED BY ";";
Performing this load on a 2.6 GB file takes about 80 s, which is four times slower that an (wc -l). Using MyISAM is only 4 seconds slower, even though is writing to disk.
What I am doing wrong here? I suppose that a data write using the memory engine must be by far faster than using MyISAM. And I don't understand why wc -l (both single threaded, but writing to mem is not that slow) is that much faster.
PS: changing read_buffer_size or any other vars I found googling, did not result in significant improvements.
try setting following variables as well
max_heap_table_size=40GB;
bulk_insert_buffer_size=32MB
read_buffer_size=1M
read_rnd_buffer_size=1M
It may reduce query execution time slightly.
Also CONCURRENT works only with MyISAM table and it slows inserts according to manual refer: Load Data Infile
I think you can't compare speed of insert which is a write operation with wc -l which is read operation as writes are always slower as compared to reads.
Loading 2.6GB data in RAM is going to take considerable amount of time. It mostly depends on the write speed of RAM and IO configuration of your OS.
Hope this helps.
I think the reason you didn't see a significant difference between the MEMORY engine and the MyISAM engine is due to disk caching. You have 48GB of RAM and are only loading 2.6GB of data.
The MyISAM engine is writing to 'files' but the OS is using its file caching features to make those file writes actually occur in RAM. Then it will 'lazily' make the actual writes to disk. Since you mentioned 'wc', I'll assume you are using Linux. Read up on the dirty_ratio and dirty_background_ratio kernel settings as a starting point to understanding how that works.

Best Linux filesystem for MySQL with a 100% SELECT workload

I have a MySQL database that contains millions of rows per table and there are 9 tables in total. The database is fully populated, and all I am doing is reads i.e., there are no INSERTs or UPDATEs. Data is stored in MyISAM tables.
Given this scenario, which linux file system would work best? Currently, I have xfs. But, I read somewhere that xfs has horrible read performance. Is that true? Should I shift the database to an ext3 file system?
Thanks
What about a RAM disk?
it's not about the FS but it can improve your SELECTs. did you evaluated the mysql table partitioning ?