We are currently importing very large CSV files into a mySQL data warehouse. A key part of the processing is to flag whether a record in the CSV file match an existing record in the warehouse. The "match" is done by comparing specific fields in the new data against the previous version of the table. If the record is "new" or if there have been updates, we want to add it to the warehouse.
At the moment the processing plan is as follows :
~ read CSV file into mySQL table A
~ is primary key on A on old-A? If it isnt set record status to "NEW"
~ if key is on old-A, issue update statement , JOINING old-A to A
~ if A.field1 = old-A.field1 OR A.field2 = A.old-A.field2 OR A.field3 = old-A.field3 THEN flag record status as "UPDATE"
~ process NEW or UPDATEd records according to record status
File-size on A and old-A is currently in the order of 50M records. We would expect new records to be 1M, updates to be 5-10M.
Although we are currently using MYSQL for this processing, I am wondering whether it would simply be better to do this using a scripting language? We are finding in particular that the step to flag the updates is very time consuming. Essentially we have an UPDATE statement that is unable to use any indexation.
so
CREATE TABLE A (key1 bigint,
field1 varchar(50),
field2 varchar(50),
field 3 varchar(50) );
LOAD DATA ...
... add field rec_status to table A
... then
UPDATE A
LEFT JOIN old-A ON A.key1 = old-A.key1
SET rec_status = 'NEW'
WHERE old-A.key1 = NULL;
UPDATE A
JOIN old-A ON A.key1 = old-A.key1
SET rec_status = 'UPDATED'
WHERE A.field1 <> old-A.field1
OR A.field2 <> old-A.field2
OR A.field3 <> old-A.field3;
...
I will consider skipping the "flag" step. Process the CSV file using script or MySql table A using MySQL statement, select a record from old-A table base on whatever criteria, such as field1, or/and field2... of table A, if found, lock and update old-A record, delete processed record from CSV or table A. If not found, create record in old-A with data.
Related
I want to update a MySQL table from matlab in bulk. The current logic that I use iterates over the array and inserts it one-by-one which takes way too long.
Here is my current implementation-
function update_table(customer_id_list, cluster_id_list, write_conn)
num_customers = size(customer_id_list, 1);
for idx=1:num_customers+1
customer_id = customer_id_list(idx);
cluster_id = cluster_id_list(idx);
sql = strcat(sql, 'UPDATE table SET cluster_id = ', num2str(cluster_id), ' WHERE customer_id = ', num2str(customer_id));
exec(write_conn, sql);
end
end
Tried to look for documentation to do bulk update/insert, but haven't found anything yet.
Do an "upjoin" using a temporary table.
Build your update specification as a Matlab table array with all the cluster_id and customer_id pairs that specify the new values.
Create a SQL temporary table that contains columns for the key columns you'll be matching on and the columns to update.
CREATE TEMPORARY TABLE my_temp_table SELECT customer_id, cluster_id FROM table WHERE 1 = 0
Batch-insert your update specification data from Matlab into the temporary table using Matlab Database Toolbox's datainsert or sqlwrite.
Update the target table en masse by joining it to the temp table: UPDATE table SET targ.cluster_id = upd.cluster_id FROM table targ INNER JOIN my_temp_table upd ON targ.customer_id = upd.customer_id.
Drop the temp table.
Boom. If you're going to do this a lot, wrap it up in a generic upjoin() function.
See the Matlab documentation for datainsert and sqlwrite. Do not use fastinsert; despite its name, it is much slower than datainsert and sqlwrite.
Alright, I have multiple MySQL statements that lead into an issue I'm having updating a particular table. First let me show you my code, then I'll explain what I'm trying to do:
/*STEP 1 - create a temporary table to temporarily store the loaded csv*/
CREATE TEMPORARY TABLE IF NOT EXISTS `temptable1` LIKE `first60dayactivity`;
/*STEP 2. load the csv into the previously created temporary table*/
LOAD DATA LOCAL INFILE '/Users/me/Downloads/some.csv'
IGNORE INTO TABLE `{temptable}`
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"'
LINES TERMINATED BY '\r\n'
IGNORE 1 LINES
SET CUSTID = 1030,
CREATED = NOW(),
isactive = 1;
/*STEP 3. update first60dayactivity table changing isactive for records that are not in the temptable*/
UPDATE `first60dayactivity` fa
INNER JOIN `temptable1` temp
ON temp.`mid` = fa.`mid`
AND temp.`primarypartnername` = fa.`primarypartnername`
AND temp.`market` = fa.`market`
AND temp.`agedays` = fa.`agedays`
AND temp.`opendate` = fa.`opendate`
AND temp.`CUSTID` = fa.`CUSTID`
SET fa.isactive = IF( temp.`mid` IS NULL, 0, 1 );
/*STEP 4. insert the temp table records into the real table*/
.....blah blah blah.....
Ok, first create a temporary table so that we have a table to hold the imported .csv data. Next, import the .csv data into the temporary table (all this works perfectly so far).
Here is where I run into an issue. I'm wanting to update the isactive column of each record of the first60dayactivity table to 0 if the record is NOT found in temptable1 (after my import). Ultimately, I'm gathering a .csv, the .csv has the new live data that should be considered "active" and I need to set the old data to inactive. So, the update does an INNER JOIN to match on several column to see if the record is found in the temptable1, if it isn't then set the activity to 0, if it is found in temptable1 then ensure the activity status is 1.
The problem here is that all records in first60dayactivity are retaining the 1 property to indicate it is active. Nothing is getting updated to 0 even though I have proof new records exist within temptable1... Can someone tell me what I'm doing wrong in my query?
Thanks in advance!
temp.mid can never be NULL because you use this column in your join condition and you use an INNER JOIN.
Your join (without the insert) should return the matching rows. Using a LEFT JOIN for the update should do what I suppose you want to do.
I have a MySQL table like this,
id (primary key) | name | scores
and I am reading a large file to insert records into the MySQL table.
New records will be added into this file but the old records are not deleted, so when I read the file, a lot of records are already in the database.
Except to use SELECT COUNT to see if a record is already in the database, is there a best way to check it (to save processing time & database load)?
Or maybe I should just INSERT it directly? (The database will not allow records with duplicate id anyway.)
I usually use update + insert method.
first i will run the update statement. the update query will act like a select query + directly update the data.
update t1 set t1.Name = 'Name', t1.Scores = 99
where t1.Name = 'Name' and t1.Scores = 99
then check if there is a row affected by the above query. if not run the insert statement
if ##RowCount = 0
insert into t1 (Name, Scores) values ('Name',99)
Serch examples for
INSERT IGNORE INTO table
Simple example for this is
INSERT IGNORE INTO `transcripts`
SET `ensembl_transcript_id` = ‘ENSORGT00000000001′,
`transcript_chrom_start` = 12345
`transcript_chrom_end` = 12678;
There are approximately 26K products (posts) and each product has meta values like this:
The post_id column is the product id in db and the _sku (meta_key) is the unique id for each product.
I've received a new CSV file that updates all of the values (meta_value) for _sale_price (meta_key) of each product. The CSV file looks like:
SKU, Sale Price
How do I import this CSV to update only the _sale_price row based on the post_id (product id) & _sku value?
Output Example:
I know how to do this in PHP by looping through the CSV and selecting & executing an update for each single product but this seems inefficient.
Preferably with phpMyAdmin and by using LOAD DATA INFILE.
You can use temporary table to hold the update data and then run single update statement.
CREATE TEMPORARY TABLE temp_update_table (meta_key, meta_value)
LOAD DATA INFILE 'your_csv_pathname'
INTO TABLE temp_update_table FIELDS TERMINATED BY ';' (meta_key, meta_value);
UPDATE "table"
INNER JOIN temp_update_table on temp_update_table.meta_key = "table".meta_key
SET "table".meta_value = temp_update_table.meta_value;
DROP TEMPORARY TABLE temp_update_table;
If product_id is the unique column of that table, you can do that using CSV:
Have a CSV file of those you want to import with their unique ID. CSV file must be in same order of the table column, put all your columns and no column name
Then in phpMyAdmin, go to the table of database, click import
Select CSV in the drop-down of Format field
Make sure "Update data when duplicate keys found on import (add ON DUPLICATE KEY UPDATE)" is checked.
You can import the new data into another table (table2). Then update your primary table (table1) using a update with a sub-select:
UPDATE table1 t1 set
sale_price = (select meta_value from table2 t2 where t2.post_id = t1.product_id)
WHERE
(select count(*) from table2 t2 where t1.product_id = t2.post_id) > 0
This is obviously a simplification and you will most likely need to constrain your query a little further.
Make sure to backup your full database before attempting. I recommend you work on a non-production database until the process works flawlessly.
It seems to me that rAndom69's answer does not work on postgresql 12 but the join with the WHERE work:
UPDATE tableA
SET fieldToPopulateInTableA = temp_update_table.fieldPopulated
FROM temp_update_table
WHERE tableA.correspondingField = temp_update_table.correspondingField
I have a SSIS package that copies data from table A to table B and sets a flag in table A so that the same data is not copied subsequently. This works great by using the following as the SQL command text on the ADO Net Source object:
update transfer
set ProcessDateTimeStamp = GetDate(), LastUpdatedBy = 'legacy processed'
output inserted.*
where LastUpdatedBy = 'legacy'
and ProcessDateTimeStamp is not null
The problem I have is that I need to run a similar data copy but from two sources table, joined on a primary / foreign key - select from table A join table B update flag in table A.
I don't think I can use the technique above because I don't know where I'd put the join!
Is there another way around this problem?
Thanks
Rob.
You can use a join in an update statement.
update m
set ProcessDateTimeStamp = GetDate(),
LastUpdatedBy = 'legacy processed',
somefield = t.someotherfield
output inserted.*
from transfer t
join mytable m
on t.id = m.id
where m.LastUpdatedBy = 'legacy'
and m.ProcessDateTimeStamp is null
and t.ProcessDateTimeStamp is not null
The key is to not alias the fields on the left side of the set but to alias everything else. And use the table alias for the table you are updating after the update key word so it knows which table of the join to update.