r - dbWriteTable or a MySQL Delete query? - mysql

I can't seem to find the answer to this anywhere. I am reading a csv into a data frame using the read.csv function. Then I am writing the data frame contents to a mysql table using dbWriteTable. This works great for the initial run to create the table, but I each run after this needs to do either an insert or an update depending on whether the record already exists in the table.
The 1st column in the data frame is the primary key, and the other records contain data that might change every time I pull a new copy of the csv. Each time I pull the CSV, if the primary key already exists, I want it to update that record with the new data, and if the primary key does not exist(eg: a new key since the last run), I want it to just insert the record into the table.
This is my current dbWriteTable. This creates the table just fine the 1st time it's run, and also inserts a "Timestamp" column into the table that is set to "on update CURRENT_TIMESTAMP" so that I know when each record was last updated.
dbWriteTable(mydb, value=csvData, name=Table, row.names=FALSE, field.types=list(PrimaryKey="VARCHAR(10)",Column2="VARCHAR(255)",Column3="VARCHAR(255)",Timestamp="TIMESTAMP"), append=TRUE)
Now the next time I run this, I simply want it to update any PrimaryKeys that are already in the table, and add any new ones. I also don't want to lose any records in the event a PrimaryKey disappears from the CSV source.
Is it possible to do this kind of update using dbWriteTable, or some other R function?
If that's not possible, is it possible to just run a mysql query that would delete any duplicate PrimaryKey records and keep just the 1 record with the most current timestamp? So I would run a dbWriteTable to append the new data, and then run a MySQL query to prune out the older records.
Obviously I couldn't define that 1st column as an actual PrimaryKey in the DB as my append/delete solution wouldn't work due to duplicate keys, and that's fine, I can always add an auto increment integer column to the table for the "real" primary key if needed.
Thoughts?

Consider using a temp table (an exact replica of final table but with less records) and then run an INSERT and UPDATE query into final table which will handle both cases without overlap (plus primary keys are constraints and queries will error out if attempts are made to duplicate any):
records to append if not exists - using the LEFT JOIN NULL query
records to update if does exist. - using the UPDATE INNER JOIN query
Concerning the former there is a regular debate among SQL coders if LEFT JOIN NULL or NOT IN or NOT EXISTS is the optimal solution which of course "depends". Left Join used here does avoid subqueries. But consider those avenues if needed.
# DELETE LAST SET OF TEMP DATA
dbSendQuery(mydb, "DELETE FROM tempTable")
# APPEND R DATA FRAME TO TEMP DATA
dbWriteTable(mydb, value=csvData, name=tempTable, row.names=FALSE,
field.types=list(PrimaryKey="VARCHAR(10)", Column2="VARCHAR(255)",
Column3="VARCHAR(255)", Timestamp="TIMESTAMP"),
append=TRUE, overwrite=FALSE)
# LEFT JOIN ... NULL QUERY TO APPEND NEW RECORDS NOT IN TABLE
dbSendQuery(mydb, "INSERT INTO finalTable (Column1, Column2, Column3, Timestamp)
SELECT Column1, Column2, Column3, Timestamp
FROM tempTable f
LEFT JOIN finalTable t
ON f.PrimaryKey = t.PrimaryKey
WHERE f.PrimaryKey IS NULL;")
# UPDATE INNER JOIN QUERY TO UPDATE MATCHING RECORDS
dbSendQuery(mydb, "UPDATE finalTable f
INNER JOIN tempTable t
ON f.PrimaryKey = t.PrimaryKey
SET f.Column1 = t.Column1,
f.Column2 = t.Column2,
f.Column3 = t.Column3,
f.Timestamp = t.Timestamp;")
For the most part, queries above will be compliant in most SQL backends should you ever need to change databases. Some RDMS do not support UPDATE INNER JOIN but equivalent alternatives are available. Finally, the beauty of this route is all processing is handled in the SQL engine and not in R.

Sounds like you're trying to do an upsert.
I'm kind of rusty with MySQL but the general idea is that you need to have a staging table to upload the new CSV, and then in the database itself do the insert/update.
For that you need to use dbSendQuery with INSERT ON DUPLICATE UPDATE.
http://dev.mysql.com/doc/refman/5.7/en/insert-on-duplicate.html

Related

Dealing with large overlapping sets of data - updating just the delta

My Python application generates a CSV file containing a few hundred unique records, one unique row per line. It runs hourly and very often the data remains the same from one run to another. If there are changes, then they are small, e.g.
one record removed.
a few new records added.
occasional update to an existing record.
Each record is just four simple fields (name, date, id, description), and there will be no more than 10,000 records by the time the project is at maximum, so it can all be contained in a single table.
What the best way to merge changes into the table?
A few approaches I'm considering are:
1) empty the table and re-populate on each run.
2) write the latest data to a staging table and run a DB job to merge the changes into the main table.
3) read the existing table data into my python script, collect the new data, find the differences, run multiple 'CRUD' operations to apply the changes one by one
Can anyone suggest a better way?
thanks
I would do this in the following way:
Load the new CSV file into a second table.
DELETE rows in the main table that are missing from the second table:
DELETE m FROM main_table AS m
LEFT OUTER JOIN new_table AS t ON m.id = t.id
WHERE t.id IS NULL;
Use INSERT ON DUPLICATE KEY UPDATE to update rows that need to be updated. This becomes a no-op on each row that already contains the same values.
INSERT INTO main_table (id, name, date, description)
SELECT id, name, date, description FROM new_table
ON DUPLICATE KEY UPDATE
name = VALUES(name), date = VALUES(date), description = VALUES(description);
Drop the second table once you're done with it.
This is assuming id is the primary key and you have no other UNIQUE KEY in the table.
Given a data set size of 10,000 rows, this should be quick enough to do it in one batch. Once the data set gets 10x larger, you may have to reconsider the solution, for example do batches of 10,000 rows at a time.

Reading incremental data based on the value of composite Primary key

I have an OLTP (source) from where data has to be moved to the DWH(destination) on an incremental basis.
The source table has a composite Primary Key on Loan_id, AssetID as shown below.
LOAN_ID, ASSETID, REC_STATUS
'12848','13170', 'F'
Had it been a single col primary key then I would check for the max value of the column at the destination and then read all the records from the source where the Primary key value is greater than the max value at the destination, but as it is a composite primary key, this will not work.
Any idea how this can be done using T-SQL Query?
Specs: Source is an MYSQL DB and the destination is MSSQL 2012. The connection is made using a linked server.
There's a few things that you can try. Dealing with a linked server and not knowing the specifics of that setup, along with volume of data, performance could be an issue.
If you're not worried about changes in existing records or deletes, a simple left outer join will get you any records that haven't been inserted into your destination yet:
SELECT [s].[LOAD_ID]
, [s].[ASSETID]
, [s].[REC_STATUS]
FROM [LinkedServer].[Database].[schema].[SourceTable] [s]
LEFT OUTER JOIN [DestinationTable] [d]
ON [s].[LOAN_ID] = [d].[LOAN_ID]
AND [s].[ASSETID] = [d].[ASSETID]
WHERE [d].[LOAN_ID] IS NULL;
If you're worried about changes you could still use a left outer and look for NULL in destination or differences in field values, but then you'd need an additional update statement.
SELECT [s].[LOAD_ID]
, [s].[ASSETID]
, [s].[REC_STATUS]
FROM [LinkedServer].[Database].[schema].[SourceTable] [s]
LEFT OUTER JOIN [DestinationTable] [d]
ON [s].[LOAN_ID] = [d].[LOAN_ID]
AND [s].[ASSETID] = [d].[ASSETID]
WHERE [d].[LOAN_ID] IS NULL --Records from source not in destination
OR (
--This evaluates those in the destination, but then checks for changes in field values.
[d].[LOAN_ID] IS NOT NULL
AND (
[s].[REC_STATUS] <> [d].[REC_STATUS]
OR [s].[SomOtherField] <> [d].[SomeOtherField]
)
);
--The above insert into some landing or staging table on the destination side and then you could do a MERGE.
If we need to worry about deletes. A record was deleted from the source and you don't want it in the destination anymore. Flip the left outer to find records in your destination no longer in the source:
DELETE [d]
FROM [DestinationTable] [d]
LEFT OUTER JOIN [LinkedServer].[Database].[schema].[SourceTable] [s]
ON [s].[LOAN_ID] = [d].[LOAN_ID]
AND [s].[ASSETID] = [d].[ASSETID]
WHERE [s].[LOAD_ID] IS NULL;
You could attempt doing all of this using a merge. Try the MERGE over a linked server or bring all the source records to the destination in a land/stage table and then do the merge there. Here's an example of attempting over a linked server.
MERGE [DestinationTable] [t]
USING [LinkedServer].[Database].[schema].[SourceTable] [s]
ON [s].[LOAN_ID] = [d].[LOAN_ID]
AND [s].[ASSETID] = [d].[ASSETID]
WHEN MATCHED THEN UPDATE SET [REC_STATUS] = [s].[REC_STATUS]
WHEN NOT MATCHED BY TARGET THEN INSERT (
[REC_STATUS]
)
VALUES ( [s].[REC_STATUS] )
WHEN NOT MATCHED BY SOURCE THEN DELETE;
When dealing with a merge, you gotcha watch out for this statement:
WHEN NOT MATCHED BY SOURCE THEN DELETE;
If you're not working with the entire record set you could lose records in you destination. Example would be, you've limited the results set you pulled from the source into a staging table, you now merge the staging table with the final destination, anything outside of that would get deleted in your destination. You can solve that by limiting your target with a CTE, Google: "merge into cte as the target". That's if you have a date you can filter on.
If you have a date column, that's always helpful, especially some sort of Change/Update date column when new records are inserted or updated. Then you can filter on your source to only those records you care about.
Incremental loads typically have a date driving them.
You can use a composite key inside a lookup. This has been answered many times.
Add a lookup and change the test to redirect no match (default is fail).
Basically you check if the key exists in the destination.
If the key exist then it is an update (match).
If the key does not exist (no match) then it is an insert.

Deleting non referenced data from database

Because of a bad design, I have to clean up a database. There's data in there which is not "connected" correctly (Foreign keys were not set)
Therefore I want to delete all data which is not referenced.
I have created with a join a temporary table temp1 and inserted all the Entity_ID which have no connection to the main table entity. The next step is, I want to delete from entityisactive all the bad data with following query:
Delete from db1.entityisactive where db1.entityisactive.Entity_ID IN
(
Select db1.temp1.Entity_ID from db1.temp1
)
The Problem is, I get a connection time out, also when I do thatSelect db1.temp1.Entity_ID from db1.temp1 where Entity_ID = 42
What I want to do is delete all entries in entityisactive where entityisactive.Entity_ID = temp1.Entity_ID
How can I speed up the SQL-Query? Or where is my error in reasoning?
I would suggest using an explicit join and then defining indexes. The code would look like:
Delete eia
from db1.entityisactive eia join
db1.temp1 t
on eia.Entity_ID = t.Entity_ID ;
Then, this query wants an index on either temp1(Entity_Id) or entityisactive(Entity_Id). Both indexes are not necessary and the second might have the better performance.

MySQL UPDATE table 1 and INSERT on table2 if id doesnt exist

I have a left join query that shows all the fields from a primary table (tblMarkers) and the values from a second table (tblLocations) where there is matching record.
tblLocations does not have a record for every id in tblMarkers
$query ="SELECT `tblMarkers`.*,`tblLocation`.*,`tblLocation`.`ID` AS `markerID`
FROM
`tblMarkers`
LEFT JOIN `tblLocation` ON `tblMarkers`.`ID` = `tblLocation`.`ID`
WHERE
`tblMarkers`.`ID` = $id";
I am comfortable with using UPDATE to update the tblMarkers fields but how do I update or INSERT a record into tblLocations if the record does not exist yet in tblLocations.
Also, how do I lock the record I ma working on to prevent someone else from doing an update at the same time?
Can I also use UPDATE tblMarkers * or do I have to list every field in the UPDATE statement?
Unfortunately you might have to implement some validation in your outside script. There is an IF statement in SQL, but I'm not sure if you can trigger different commands based on it's outcome.
Locking
In terms of locking, you have 2 options. for MyISAM tables, you can only lock the entire table using http://dev.mysql.com/doc/refman/5.0/en/lock-tables.html
LOCK TABLE users;
For InnoDB tables, there is no explicit 'lock' for single rows, however you can use transactions, to get exclusive rights during the operation. http://dev.mysql.com/doc/refman/5.0/en/innodb-locks-set.html
Update
There might be some shorthand notation, but I think you have to list every field in your query. Alternatively, you can always read the entire row, delete it and insert again using shorthand INSERT query. It all depends on how many fields you've got.

Importing MySQL records with duplicate keys

I have two MySQL databases with identical table structure, each populated with several thousand records. I need to merge the two into a single database but I can't import one into the other because of duplicate IDs. It's a relational database with many linked tables (fields point to other table record IDs).
Edit: The goal is to have data from both databases in one final database, without overwriting data, and updating foreign keys to match with new record IDs.
I'm not sure how to go about merging the two databases. I could write a script I suppose, but there's many tables and it would take a while to do. I wondered if anyone else had encountered this problem, and the best way to go about it?
Just ignore the duplicates. The first time the key is inserted, it'll be inserted. The second time it will be ignored.
INSERT IGNORE INTO myTable (SELECT * FROM myOtherTable );
See the full INSERT syntax here.
The trick was to increment the IDs in one database by 1000 (or something won't overlap data in the target database), then import it.
Thanks for everyone's answers.
Are the duplicate IDs supposed to correspond to each other? You could create a new table with an auto increment field and save the existing keys as two columns.
That would just be a 'bulk copy' though. If there is some underlying relationship then that would dictate how to combine the data.
If you have two tables A1 and A2 and you want to merge this to AA you can do this:
INSERT INTO aa SELECT * FROM A1;
INSERT INTO aa SELECT * FROM A2 ON DUPLICATE KEY
UPDATE aa.nonkeyfield1 = a1.nonkeyfield1,
aa.nonkeyfield2 = a1.nonkeyfield2, ....;
This will overwrite fields with duplicate keys with A2 data.
A slightly slower method with simpler syntax is:
INSERT INTO aa SELECT * FROM A1;
REPLACE INTO aa SELECT * FROM A2;
This will do the same thing, but will not update duplicate rows, but instead delete the row from A1 first and then reinsert the data from A2.
If you want to merge a whole database with foreign keys, this will not work, because it will break the links between tables.
If you have a whole database and you do not want to overwrite data
I'd import the first database as normal into database A.
import the second database into a database B.
Set all foreign keys as on update cascade.
Double check this.
Now run the following statement on all tables on database B.
SELECT #increment:= MAX(pk) FROM A.table1;
UPDATE B.table1 SET pk = pk + #increment WHERE pk IS NOT NULL
ORDER BY pk DESC;
(The where clause is to stop MySQL from giving an error in strict mode)
If you write a script with those two lines per table in your database you can then insert all tables into database AA, remember to disable foreign key checks during the update with
SET foreign_key_checks = 0;
... do lots of inserts ...
SET foreign_key_checks = 1;
Good luck.
Create a new database table with an autoincrimented primary key as the first column. Then add the column names from your databases and import each one. Then just drop the old primary field, and rename the new one to match your primary name.