Index Creation after Load Data InFile - mysql

I'm using MySQL v5.6.
I'm inserting about 10 millions rows in a newly created table (InnoDB). I try to choose the best way to do this between "Load Data InFile" and multiple inserts.
Load Data InFile should be (and is) more efficient, but I'm observing a weird thing: the index creation is much more longer (by 15%) when using "load data infile"...
Step to observe that (each step starts when the previous is all done):
I create a new table (table_1)
I create a new table (table_2)
I insert 10 millions rows in table_1 with multiple insert (batches of 5000)
I insert 10 millions rows in table_2 with load data infile
I create 4 indexes at a time (with alter Table) on table_1
I create 4 indexes at a time (with alter Table) on table_2 -> about 15% more longer than the previous step
What could explain that?
(Of course, results are the same with steps ordered 2, 1, 4, 3, 6, 5.)

It's possible that the data load with INSERT resulted in more data pages left occupying the buffer pool. When creating the indexes on the table that used LOAD DATA, it first had to load pages from disk into the buffer pool, and then index the data in them.
You can test this by querying after you load data:
SELECT table_name, index_name, COUNT(*)
FROM INFORMATION_SCHEMA.INNODB_BUFFER_PAGE
WHERE table_name IN ('`mydatabase`.`table_1`', '`mydatabase`.`table_2`')
GROUP BY table_name, index_name;
Then do this again after you build your indexes.
(Of course replace mydatabase with the name of the database you create these tables in.)

Related

Fastest way to replace data in a table from a temporary table in MySQL

I have a need to "update" some table data I receive from external source (every time I receive "all" data, with some fields for some records updated).
There's no unique field or combination of fields, and thus I figured the best way would be to every time to wipe out all data from DB and write all (now updated) data in again. There are up to a 1000 records (there will never be more than that), about 15 short fields each: text, numbers, datetime. And I'm writing it to remote DB (so, it's slow).
Currently I'm doing:
delete from `table` where `date_dt` > ?
and then for each row
INSERT INTO `table` ( `field_0`,`field_1`,... ) VALUES (?,?,...)
It's not only slow, but it's possible that the end user may not see the complete data while I'm still inserting.
I figured I could do:
CREATE TEMPORARY TABLE `temp_table` ( ... ); -- same structure as in main table
INSERT INTO `temp_table` ( `field_0`,`field_1`,... ) VALUES (?,?,...) -- repeat 1000x
START TRANSACTION;
DELETE FROM `table`;
INSERT INTO `table` SELECT * FROM `temp_table`;
DROP `temp_table`;
COMMIT;
Does this makes any sense? What's is a better way of solving this?
The speed of filling up the temp table with data is not crucial, but filling the main table with data is (so users don't see incomplete data, or the period of time they do is minimal).
mysqlimport --delete will truncate the table first, and then load your external data from a CSV file. It runs many times faster than doing INSERT one row at a time.
See https://dev.mysql.com/doc/refman/5.7/en/mysqlimport.html
I did a presentation in April 2017 about performance of bulk data loads for MySQL:
https://www.slideshare.net/billkarwin/load-data-fast
P.S.: Don't use the temp table solution if you have a MySQL replication environment. This is a well-known way of breaking replication. If the slave restarts in between your creation of the temp table and the INSERT...SELECT that reads from the temp table, then the slave will find the temp table is gone, and this will result in an error and stop replication. This might seem unlikely, but it does happen eventually.

Difference in datafile size on a same data?

I use MariaDB 10.1.16
I do very simple job this time.
Select data from oracle and Make csv file
Load that in MariaDB using load infile data command.
DB Engine is InnoDB.
Data row count is 6497641.
Both table is created same query.
PK is auto_increment and int type;
Row created by...
TABLE1 - load infile data...
TABLE2 - insert into TABLE2 select * from TABLE1...
size of tables is below.
TABLE1 - 3.3GBytes
TABLE2 - 1.9GBytes
Contents in mysql.innodb_table_stats is below.
TABLE1: n_rows(5438171) , clustered_index_size(196096), sum_of_other_index_sizes(12853)
TABLE2: n_rows(6407131) , clustered_index_size(106048), sum_of_other_index_sizes(12273)
I wanna know why the size of files is different.
thank you.
The order of the rows can make a big difference. If the data is sorted by the PRIMARY KEY as it is inserted, the blocks will be packed nearly full. If the rows are randomly sorted, the end result will be blocks that are about 69% full. This is the nature of inserting into a BTree.
n_rows is just an approximation, hence the inconsistent count. The other pair of values, I think, an exact number of 16KB blocks.
Since the PK is "clustered" with the data, the clustered_index_size is the size of the data, plus some overhead for the BTree on the PK. Plus a lot of overhead and/or wasted space (as mentioned above).

How to load a huge data set into newly created table?

I'm trying to FULLTEXT index into my table. That table content 3 million records.It was very difficult to insert that index using Alter table statement or Create index statement. Therefor easiest way to create new table and 1st add index and load the data. How can I load existing table data into newly created table? I'm using Xammp MySql database.
I don't know why creating a full text index on an existing table would be difficult. You just do:
create fulltext index idx_table_col on table(col)
Usually, it is faster to add indexes to already loaded tables than to load data into an empty table that has indexes pre-defined.
EDIT:
You can do the load by using insert. The following will insert the first 100,000 rows:
insert into newtable
select *
from oldtable
order by id
limit 0, 100000;
You can put this in a loop (via a stored procedure in MySQL or at the application level). Perhaps this will return faster. Each time you run it, you would change the offset value in limit.
I would expect that the overall time for creating an index would be less than using insert, but for your purposes, you might find this more convenient.
INSERT INTO newTable SELECT * FROM oldTable;
After your new table and index on it is created.
This is given you want to copy all columns. You can select specific columns as well.

LOAD DATA INFILE with a SELECT statement

I have the following database relationship:
I also have this large CSV file that I want to insert into bmt_transcripts:
Ensembl Gene ID Ensembl Transcript ID
ENSG00000261657 ENST00000566782
ENSG00000261657 ENST00000562780
ENSG00000261657 ENST00000569579
ENSG00000261657 ENST00000568242
The problem is that can't insert the Ensemble Gene ID as a string, I need to find its ID from the bmt_genes table, so I came up with this code:
LOAD DATA INFILE 'filename.csv'
INTO TABLE `bmt_transcripts`
(#gene_ensembl, ensembl_id)
SET gene_id = (SELECT id FROM bmt_genes WHERE ensembl_id = #gene_ensembl);
However this takes over 30 minutes to load a 7mb CSV, which is far too long. I assume it's running a table-wide query for every row it inserts, which is obviously horribly inefficient. I know I could load the data into a temporary table and SELECT from that (which, yes, runs in some 5 seconds), but this CSV may grow to have some 20 columns, which will become unwieldy to write a select statement for.
How can I fix my LOAD DATA INFILE query (which runs a SELECT on another table) to run in a reasonable length of time?

LOAD DATA reclaim disk space after delete

I have a DB schema composed of MYISAM tables, i am interested to delete old records from time to time from some of the tables.
I know that delete does not reclaim the memory space, but as i found in a description of DELETE command, inserts may reuse the space deleted
In MyISAM tables, deleted rows are maintained in a linked list and subsequent INSERT operations reuse old row positions.
I am interested if LOAD DATA command also reuses the deleted space?
UPDATE
I am also interested how the index space reclaimed?
UPDATE 2012-12-03 23:11
some more info supplied based on the answer received from #RolandoMySQLDBA
after executing the following suggested query i got different results for different tables for which space need to be reused or reclaimed:
SELECT row_format FROM information_schema.tables
WHERE table_schema='mydb' AND table_name='mytable1';
> Dynamic
SELECT row_format FROM information_schema.tables
WHERE table_schema='mydb' AND table_name='mytable2';
> Fixed
UPDATE 2012-12-09 08:06
LOAD DATA do reuses previously deleted space (i have checked it by running a short script) if and only if the row format is fixed or (the row format is dynamic and there is a deleted row with exactly the same size).
it seems that if the row_format is dynamic, full look-up over the deleted list is made for each record , and if the exact row size is not found , the deleted record is not used, and the table memory usage will raise, additionally LOAD DATA will take much more time to import records.
I will except the answer given here , since it describes all the process perfectly.
For a MySQL table called mydb.mytable just run the following:
OPTIMIZE TABLE mydb.mytable;
You could also do this in stages:
CREATE TABLE mydb.mytable_new LIKE mydb.mytable;
ALTER TABLE mydb.mytable_new DISABLE KEYS;
INSERT INTO mydb.mytable_new SELECT * FROM mydb.mytable;
ALTER TABLE mydb.mytable_new ENABLE KEYS;
ALTER TABLE mydb.mytable RENAME mydb.mytable_old;
ALTER TABLE mydb.mytable_new RENAME mydb.mytable;
ALTER TABLE mydb.mytable_old;
ANALYZE TABLE mydb.mytable;
In either case, the table ends up with no fragmentation.
Give it a Try !!!
UPDATE 2012-12-03 12:50 EDT
If you are concerned whether or not rows are reused upon bulk INSERTs via LOAD DATA INFILE, please note the following:
When you created the MyISAM table, I assumed the default row format would be dynamic. You can check what it is with either
SHOW CREATE TABLE mydb.mytable\G
or
SELECT row_format FROM information_schema.tables
WHERE table_schema='mydb' AND table_name='mytable';
Since the row format of your table is Dynamic, the fragmented rows are of various sizes. The MyISAM storage engine would have keep checking for the row length of each deleted to see if the next set of data being insert will fit. If the incoming data cannot fit in any of the deleted rows, then the new row data is appended.
The presence of such rows can make myisamchk struggle.
This is why I recommended running OPTIMIZE TABLE. That way, data would be appended quicker.
UPDATE 2012-12-03 12:58 EDT
Here is something interesting you can also do: Try setting concurrent_insert to 2. That way, you are always appending to a MyISAM table without checking for gaps in the table. This will speed up INSERTs dramatically but leave all known gaps alone.
You could still defragment your table at your earliest convenience using OPTIMIZE TABLE.
UPDATE 2012-12-03 13:40 EDT
Why don't run the my second sugesstion
CREATE TABLE mydb.mytable_new LIKE mydb.mytable;
ALTER TABLE mydb.mytable_new DISABLE KEYS;
INSERT INTO mydb.mytable_new SELECT * FROM mydb.mytable;
ALTER TABLE mydb.mytable_new ENABLE KEYS;
ALTER TABLE mydb.mytable RENAME mydb.mytable_old;
ALTER TABLE mydb.mytable_new RENAME mydb.mytable;
ANALYZE TABLE mydb.mytable;
This will give you an idea
How long OPTIMIZE TABLE would take to run
How much smaller the .MYD and .MYI would be after running OPTIMIZE TABLE
After you run my second suggestion, you can compare them with
SELECT
A.mydsize,B.mydsize,A.mydsize - B.mydsize myd_diff,
A.midsize,B.myisize,A.myisize - B.myisize myi_diff
FROM
(
SELECT data_length mydsize,index_length myisize
FROM information_schema.tables
WHERE table_schema='mydb' AND table_name='mytable'
) A,
(
SELECT data_length mydsize,index_length myisize
FROM information_schema.tables
WHERE table_schema='mydb' AND table_name='mytable_new'
) B;
UPDATE 2012-12-03 16:42 EDT
Any table whose ROW_FORMAT is set to fixed has the luxury of allocating the same length row every time. If MyISAM tables maintain a list of deleted rows, the very first row in the list should always be selected as the next row to insert data. There would be no need to traverse a whole list until a suitable row gaps with sufficient length is found. Each deleted row is quickly appended after a DELETE. Each INSERT would pick the first row of the deleted rows.
We can assume these things because MyISAM tables can do concurrent inserts. In order for this feature to be available via the concurrent_insert option, INSERTs into a MyISAM table must be able to detect one of three(3) things:
The presence of a list of deleted rows, thus choosing from the list
Row_Format=Dynamic : list of deleted rows with each row with a different length
Row_Format=Fixed : list of deleted rows with all rows the same length
The absence of a list of deleted rows, thus appending
Bypass checking for the presence of a list of deleted rows (set concurrent_insert to 2)
For detection #1 to be the fastest possible, a MyISAM table's row_format must be Fixed. If it is Dynamic, it is very possible that a list traversal is necessary.