How to get row ids when using LOAD LOCAL DATA INFILE? - mysql

I have MySQL database with table into which I insert from multiple files using
LOAD DATA LOCAL INFILE ... statement. I have PRIMARY KEY ID set to auto_increment. The problem is, when I want to update only part of the table.
Say I've inserted file_1, file_2, file_3 in the past and now I want to update only file_2. I imagine the process in pseudo workflow
delete old data related to file_2
insert new data from file_2
However, it is hard to determine, which data are originally from file_2. In order to find out, I've come up with this idea:
When I insert the data, I will note the ids of the rows, which I've inserted, since I am using auto_increment I can note something like from_id, to_id for each of the file. Then, when I want to update only file_x I will delete only the data with from_id <= id <= to_id (where from_id, to_id relates to the file_x).
After little bit of searching, I've found out about ##identity and last_insert_id() (see), however, when I use select last_insert_id() after LOAD DATA LOCAL INFILE I get only one id, and not the maximal id corresponding to the data, but the last added (as it is defined). I am connecting to the database from Python using mysql.connnector using
cur.execute("select last_insert_id();")
print(cur.fetchall())
# gives
# [(<some_number>,)]
So, is there a way, how to retrieve all (or at least the minimal and maximal) ids which were assigned to the data imported using the LOAD DATA LOCAL INFILE... statement as mentioned above?

If you need to remember the source of each record from the table then you better store the information in a field.
I would add a new field (src) of type TINYINT to the table and store the ID of the source (1 for file_1, 2 for file_2 a.s.o.). I assume there won't be more than 255 sources; otherwise use SHORTINT for its type.
Then, when you need to update the records imported from file_2 you have two options:
delete all the records having src = 2 then load the new records from file into the table; this is not quite an update, it is a replacement;
load the new records from file into a new table then copy from it the values you need to update the existing records.
Option #1
Deletion is an easy job:
DELETE FROM table_1 WHERE src = 2
Loading the new data and setting the value of src to 2 is also easy (it is explained in the documentation):
LOAD DATA INFILE 'file.txt'
INTO TABLE table_1
(column1, column2, column42) # Put all the columns names here
# in the same order the values appear in the file
SET src = 2 # Set values for other columns too
If there are columns in the file that you don't need then load their values into variables and simply ignore the variables. For example, if the third column from the file doesn't contain useful information you can use:
INTO TABLE table_1 (column1, column2, #unused, column42, ...)
A single variable (I called it #unused but it can have any name) can be used to load data from all the columns you want to ignore.
Option #2
The second option requires the creation of a working table but it's more flexible. It allows updating only some of the rows, based on usual WHERE conditions. However, it can be used only if the records can be identified using the values loaded from the file (with or without the src column).
The working table (let's name it table_w) has the columns you want to load from the file and is created in advance.
When it's the time to update the rows imported from file_2 you do something like this:
truncate the working table (just to be sure it doesn't contain any leftovers from a previous import);
load the data from file into the working table;
join the working table and table_1 and update the records of table_1 as needed;
truncate the working table (cleanup of the current import).
The code:
# 1
TRUNCATE table_w;
# 2
LOAD DATA INFILE 'file.txt'
INTO TABLE table_w
(column_1, column_2, column 42); # etc
# 3
UPDATE table_1 t
INNER JOIN table_w w
ON t.column_1 = w.column_1
# AND t.src = 2 # only if column_1 is not enough
SET t.column_2 = w.column_2,
t.column_42 = w.column_42
# WHERE ... you can add extra conditions here, if needed
# 4
TRUNCATE TABLE table_w

Related

Dealing with large overlapping sets of data - updating just the delta

My Python application generates a CSV file containing a few hundred unique records, one unique row per line. It runs hourly and very often the data remains the same from one run to another. If there are changes, then they are small, e.g.
one record removed.
a few new records added.
occasional update to an existing record.
Each record is just four simple fields (name, date, id, description), and there will be no more than 10,000 records by the time the project is at maximum, so it can all be contained in a single table.
What the best way to merge changes into the table?
A few approaches I'm considering are:
1) empty the table and re-populate on each run.
2) write the latest data to a staging table and run a DB job to merge the changes into the main table.
3) read the existing table data into my python script, collect the new data, find the differences, run multiple 'CRUD' operations to apply the changes one by one
Can anyone suggest a better way?
thanks
I would do this in the following way:
Load the new CSV file into a second table.
DELETE rows in the main table that are missing from the second table:
DELETE m FROM main_table AS m
LEFT OUTER JOIN new_table AS t ON m.id = t.id
WHERE t.id IS NULL;
Use INSERT ON DUPLICATE KEY UPDATE to update rows that need to be updated. This becomes a no-op on each row that already contains the same values.
INSERT INTO main_table (id, name, date, description)
SELECT id, name, date, description FROM new_table
ON DUPLICATE KEY UPDATE
name = VALUES(name), date = VALUES(date), description = VALUES(description);
Drop the second table once you're done with it.
This is assuming id is the primary key and you have no other UNIQUE KEY in the table.
Given a data set size of 10,000 rows, this should be quick enough to do it in one batch. Once the data set gets 10x larger, you may have to reconsider the solution, for example do batches of 10,000 rows at a time.

How to do something like SELECT "all columns except .."?

I have a MySQL table with data in it, called current. I import new data into a table called temp. Both these tables have auto_increment ID columns.
The table structure is not known in advance for the data import (there are various file structures that I need to import), event though the structure of current and temp will be the same.
Because of the unknown column configuration of the import files (tables created on the fly for each different file configuration), I cannot select specific columns, hence I would have to select all columns, less the ID column from table temp and import the result into table current.
I need to import into temp first, as the files can be large, and I need to do processing on the data before saving into the database, so I do not want to do any operations on the current table before I have imported the separate file first.
The ID column from the temp table prevents the insert into the current table due to duplicate key.
So I need something like this:
INSERT INTO `current`
(SELECT **ALL COLUMNS EXCEPT ID** FROM `temp`)
Any ideas on how to write the section ALL COLUMNS EXCEPT ID? Is this even possible?
There's no * except foo. You'll have to list all of the columns, except the ones you don't want.
SELECT field1, field2, ..., fieldN ...
You could do it via dynamic scripting, e.g. query information_schema for the field names, build up the field list as a string, prepare that string as query, execute it, etc...

LOAD DATA INFILE with a SELECT statement

I have the following database relationship:
I also have this large CSV file that I want to insert into bmt_transcripts:
Ensembl Gene ID Ensembl Transcript ID
ENSG00000261657 ENST00000566782
ENSG00000261657 ENST00000562780
ENSG00000261657 ENST00000569579
ENSG00000261657 ENST00000568242
The problem is that can't insert the Ensemble Gene ID as a string, I need to find its ID from the bmt_genes table, so I came up with this code:
LOAD DATA INFILE 'filename.csv'
INTO TABLE `bmt_transcripts`
(#gene_ensembl, ensembl_id)
SET gene_id = (SELECT id FROM bmt_genes WHERE ensembl_id = #gene_ensembl);
However this takes over 30 minutes to load a 7mb CSV, which is far too long. I assume it's running a table-wide query for every row it inserts, which is obviously horribly inefficient. I know I could load the data into a temporary table and SELECT from that (which, yes, runs in some 5 seconds), but this CSV may grow to have some 20 columns, which will become unwieldy to write a select statement for.
How can I fix my LOAD DATA INFILE query (which runs a SELECT on another table) to run in a reasonable length of time?

MySQL read parameter from file for select statement

I have a select query as follows:
select * from abc where id IN ($list);
The problem with the length of variable $list, it may have a length of 4000-5000 characters, as a result of which the length of actually executed query increases and its get pretty slow.
Is there a method to store the values of $list in a file and ask MySQL to read it from that file, similar to LOAD DATA INFILE 'file_name' for insertion into table?
Yes, you can (TM)!
First step: Use CREATE TEMPORARY TABLE temp (id ... PRIMARY KEY) and LOAD DATA INFILE ... to create and fill a temporary table holding your value list
Second step: Run SELECT abc.id FROM abc INNER JOIN temp ON abc.id=temp.id
I have the strong impression this only checks out as a win, if you use the same value list quite a few times.

MySQL: reorder rows from file association

A MySQL photo gallery script requires that I provide the display order of my gallery by pairing each image title to a number representing the desired order.
I have a list of correctly ordered data called pairs_list.txt that looks like this:
# title correct data in list
-- -------
1 kmmal
2 bub14
3 ili2
4 sver2
5 ell5
6 ello1
...
So, the kimmals image will be displayed first, then the bub14 image, etc.
My MySQL table called title_order has the same titles above, but they are not paired with the right numbers:
# title bad data in MySQL
-- -------
14 kmmal
100 bub14
31 ili2
47 sver2
32 ell5
1 ello1
...
How can I make a MySQL script that will look at the correct number-title pairings from pairs_list.txt and go through each row of title_order, replacing each row with the correct number? In other words, how can I make the order of the MySQL table look like that of the text file?
In pseudo-code, it might look like something like this:
Get MySQL row title
Search pair_list.txt for this title
Get the correct number-title pair in list
Replace the MySQL number with the correct number
Repeat for all rows
Thank you for any help!
if this is not a one time task but will be frequently called function, then maybe you can have the following scenario:
create a temp table, insert all the values from pairs_list.txt into this temp table using mysql load data infile function.
create a procedure (or a insert trigger maybe?) on that temp table which would update your main table according to whatever inserted.
in that procedure (or a insert trigger), I would have a cursor getting all values from temp table and for each value from that cursor update the selected title on your main table.
delete all from that temp table
I'd suggest you to do this simple way -
1 Remove all primary and unique keys from the title_order table, and create unique index (or primary key) on title field -
ALTER TABLE title_order
ADD UNIQUE INDEX UK_title_order_title (title);
2 Use LOAD DATA INFILE with REPLACE option to load data from the file and replace -
LOAD DATA INFILE 'pairs_list.txt'
REPLACE
INTO TABLE title_order
FIELDS TERMINATED BY ' '
LINES TERMINATED BY '\r\n'
IGNORE 2 LINES
(#col1, #col2)
SET order_number_field = #col1, title = TRIM(#col2);
...specify properties you need in LOAD DATA INFILE command.