LOAD DATA INFILE with a SELECT statement - mysql

I have the following database relationship:
I also have this large CSV file that I want to insert into bmt_transcripts:
Ensembl Gene ID Ensembl Transcript ID
ENSG00000261657 ENST00000566782
ENSG00000261657 ENST00000562780
ENSG00000261657 ENST00000569579
ENSG00000261657 ENST00000568242
The problem is that can't insert the Ensemble Gene ID as a string, I need to find its ID from the bmt_genes table, so I came up with this code:
LOAD DATA INFILE 'filename.csv'
INTO TABLE `bmt_transcripts`
(#gene_ensembl, ensembl_id)
SET gene_id = (SELECT id FROM bmt_genes WHERE ensembl_id = #gene_ensembl);
However this takes over 30 minutes to load a 7mb CSV, which is far too long. I assume it's running a table-wide query for every row it inserts, which is obviously horribly inefficient. I know I could load the data into a temporary table and SELECT from that (which, yes, runs in some 5 seconds), but this CSV may grow to have some 20 columns, which will become unwieldy to write a select statement for.
How can I fix my LOAD DATA INFILE query (which runs a SELECT on another table) to run in a reasonable length of time?

Related

Replacing a 1.5 million row data set every 5 minutes

I have a MySQL database with a table that is populated with approximately 1.5 million rows of data that needs to be entirely refreshed every 5 minutes. The data is no longer needed once it is older than 5 minutes.
Getting the data into the table is no problem...I can populate it in approximately 50-70 seconds. Where I'm having some trouble is figuring out how to shift all the old data out and replace it with new data. I need to be able to run queries at any time across the entire data set. These queries need to run very fast and they must contain only data from one data set at a time (i.e., the query should not pull a combination of new and old data during the 1 minute that the table is being updated).
I do not have much experience working with large temporary data sets, so I would appreciate some advice on how best to solve this problem.
Create partitions. You can then populate one partition while users query from the other.
To do this manually you just need something like...
CREATE TABLE tbl0 (blah)
CREATE TABLE tbl1 (blah)
CREATE TABLE meta (combined_source INT)
INSERT INTO meta VALUES (0)
CREATE VIEW combined AS
SELECT * FROM tbl0 WHERE 0 = (SELECT combined_source FROM meta) % 2
UNION ALL
SELECT * FROM tbl1 WHERE 1 = (SELECT combined_source FROM meta) % 2
Now you can insert new data into the 'inactive' table and it WON'T appear in the view.
Next, increment the value in meta. Immediately the view switches from showing data from one table to showing data from the other table.
On your next iteration you just check meta to determine which table to empty and load the new data into.
One benefit of this approach is that you don't even need to be within a transaction.

Fastest way to replace data in a table from a temporary table in MySQL

I have a need to "update" some table data I receive from external source (every time I receive "all" data, with some fields for some records updated).
There's no unique field or combination of fields, and thus I figured the best way would be to every time to wipe out all data from DB and write all (now updated) data in again. There are up to a 1000 records (there will never be more than that), about 15 short fields each: text, numbers, datetime. And I'm writing it to remote DB (so, it's slow).
Currently I'm doing:
delete from `table` where `date_dt` > ?
and then for each row
INSERT INTO `table` ( `field_0`,`field_1`,... ) VALUES (?,?,...)
It's not only slow, but it's possible that the end user may not see the complete data while I'm still inserting.
I figured I could do:
CREATE TEMPORARY TABLE `temp_table` ( ... ); -- same structure as in main table
INSERT INTO `temp_table` ( `field_0`,`field_1`,... ) VALUES (?,?,...) -- repeat 1000x
START TRANSACTION;
DELETE FROM `table`;
INSERT INTO `table` SELECT * FROM `temp_table`;
DROP `temp_table`;
COMMIT;
Does this makes any sense? What's is a better way of solving this?
The speed of filling up the temp table with data is not crucial, but filling the main table with data is (so users don't see incomplete data, or the period of time they do is minimal).
mysqlimport --delete will truncate the table first, and then load your external data from a CSV file. It runs many times faster than doing INSERT one row at a time.
See https://dev.mysql.com/doc/refman/5.7/en/mysqlimport.html
I did a presentation in April 2017 about performance of bulk data loads for MySQL:
https://www.slideshare.net/billkarwin/load-data-fast
P.S.: Don't use the temp table solution if you have a MySQL replication environment. This is a well-known way of breaking replication. If the slave restarts in between your creation of the temp table and the INSERT...SELECT that reads from the temp table, then the slave will find the temp table is gone, and this will result in an error and stop replication. This might seem unlikely, but it does happen eventually.

Index Creation after Load Data InFile

I'm using MySQL v5.6.
I'm inserting about 10 millions rows in a newly created table (InnoDB). I try to choose the best way to do this between "Load Data InFile" and multiple inserts.
Load Data InFile should be (and is) more efficient, but I'm observing a weird thing: the index creation is much more longer (by 15%) when using "load data infile"...
Step to observe that (each step starts when the previous is all done):
I create a new table (table_1)
I create a new table (table_2)
I insert 10 millions rows in table_1 with multiple insert (batches of 5000)
I insert 10 millions rows in table_2 with load data infile
I create 4 indexes at a time (with alter Table) on table_1
I create 4 indexes at a time (with alter Table) on table_2 -> about 15% more longer than the previous step
What could explain that?
(Of course, results are the same with steps ordered 2, 1, 4, 3, 6, 5.)
It's possible that the data load with INSERT resulted in more data pages left occupying the buffer pool. When creating the indexes on the table that used LOAD DATA, it first had to load pages from disk into the buffer pool, and then index the data in them.
You can test this by querying after you load data:
SELECT table_name, index_name, COUNT(*)
FROM INFORMATION_SCHEMA.INNODB_BUFFER_PAGE
WHERE table_name IN ('`mydatabase`.`table_1`', '`mydatabase`.`table_2`')
GROUP BY table_name, index_name;
Then do this again after you build your indexes.
(Of course replace mydatabase with the name of the database you create these tables in.)

How to get row ids when using LOAD LOCAL DATA INFILE?

I have MySQL database with table into which I insert from multiple files using
LOAD DATA LOCAL INFILE ... statement. I have PRIMARY KEY ID set to auto_increment. The problem is, when I want to update only part of the table.
Say I've inserted file_1, file_2, file_3 in the past and now I want to update only file_2. I imagine the process in pseudo workflow
delete old data related to file_2
insert new data from file_2
However, it is hard to determine, which data are originally from file_2. In order to find out, I've come up with this idea:
When I insert the data, I will note the ids of the rows, which I've inserted, since I am using auto_increment I can note something like from_id, to_id for each of the file. Then, when I want to update only file_x I will delete only the data with from_id <= id <= to_id (where from_id, to_id relates to the file_x).
After little bit of searching, I've found out about ##identity and last_insert_id() (see), however, when I use select last_insert_id() after LOAD DATA LOCAL INFILE I get only one id, and not the maximal id corresponding to the data, but the last added (as it is defined). I am connecting to the database from Python using mysql.connnector using
cur.execute("select last_insert_id();")
print(cur.fetchall())
# gives
# [(<some_number>,)]
So, is there a way, how to retrieve all (or at least the minimal and maximal) ids which were assigned to the data imported using the LOAD DATA LOCAL INFILE... statement as mentioned above?
If you need to remember the source of each record from the table then you better store the information in a field.
I would add a new field (src) of type TINYINT to the table and store the ID of the source (1 for file_1, 2 for file_2 a.s.o.). I assume there won't be more than 255 sources; otherwise use SHORTINT for its type.
Then, when you need to update the records imported from file_2 you have two options:
delete all the records having src = 2 then load the new records from file into the table; this is not quite an update, it is a replacement;
load the new records from file into a new table then copy from it the values you need to update the existing records.
Option #1
Deletion is an easy job:
DELETE FROM table_1 WHERE src = 2
Loading the new data and setting the value of src to 2 is also easy (it is explained in the documentation):
LOAD DATA INFILE 'file.txt'
INTO TABLE table_1
(column1, column2, column42) # Put all the columns names here
# in the same order the values appear in the file
SET src = 2 # Set values for other columns too
If there are columns in the file that you don't need then load their values into variables and simply ignore the variables. For example, if the third column from the file doesn't contain useful information you can use:
INTO TABLE table_1 (column1, column2, #unused, column42, ...)
A single variable (I called it #unused but it can have any name) can be used to load data from all the columns you want to ignore.
Option #2
The second option requires the creation of a working table but it's more flexible. It allows updating only some of the rows, based on usual WHERE conditions. However, it can be used only if the records can be identified using the values loaded from the file (with or without the src column).
The working table (let's name it table_w) has the columns you want to load from the file and is created in advance.
When it's the time to update the rows imported from file_2 you do something like this:
truncate the working table (just to be sure it doesn't contain any leftovers from a previous import);
load the data from file into the working table;
join the working table and table_1 and update the records of table_1 as needed;
truncate the working table (cleanup of the current import).
The code:
# 1
TRUNCATE table_w;
# 2
LOAD DATA INFILE 'file.txt'
INTO TABLE table_w
(column_1, column_2, column 42); # etc
# 3
UPDATE table_1 t
INNER JOIN table_w w
ON t.column_1 = w.column_1
# AND t.src = 2 # only if column_1 is not enough
SET t.column_2 = w.column_2,
t.column_42 = w.column_42
# WHERE ... you can add extra conditions here, if needed
# 4
TRUNCATE TABLE table_w

MySQL read parameter from file for select statement

I have a select query as follows:
select * from abc where id IN ($list);
The problem with the length of variable $list, it may have a length of 4000-5000 characters, as a result of which the length of actually executed query increases and its get pretty slow.
Is there a method to store the values of $list in a file and ask MySQL to read it from that file, similar to LOAD DATA INFILE 'file_name' for insertion into table?
Yes, you can (TM)!
First step: Use CREATE TEMPORARY TABLE temp (id ... PRIMARY KEY) and LOAD DATA INFILE ... to create and fill a temporary table holding your value list
Second step: Run SELECT abc.id FROM abc INNER JOIN temp ON abc.id=temp.id
I have the strong impression this only checks out as a win, if you use the same value list quite a few times.